Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 19, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

Copy link
Contributor Author

MasterPtato commented Nov 19, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry changed the base branch from 11-19-chore_pegboard-gateway_remove_tunnelack_ to graphite-base/3491 November 19, 2025 21:21
@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Implement Ping/Pong for Tunnel

Summary

This PR implements a ping/pong mechanism for the pegboard tunnel to monitor connection health and measure RTT between gateway and runner. Overall, this is a solid implementation that addresses tunnel health monitoring.


✅ Strengths

  1. Good Architecture: The refactoring into separate task modules (ping_task.rs, tunnel_to_ws_task.rs, ws_to_tunnel_task.rs) significantly improves code organization and readability.

  2. Proper Protocol Design: The ping/pong protocol is well-designed:

    • Gateway sends ToRunnerPing with timestamp
    • Runner responds with ToGatewayPong echoing the timestamp
    • Gateway calculates RTT and tracks last_pong for timeout detection
  3. Metrics Integration: Good use of OpenTelemetry metrics to track RTT (rivet_gateway_tunnel_ping_duration) for observability.

  4. Timeout Handling: The TUNNEL_PING_TIMEOUT (30s) with UPDATE_PING_INTERVAL (3s) provides reasonable connection monitoring with multiple ping attempts before timeout.


🔍 Issues & Suggestions

1. Potential Race Condition in Ping Timeout Check ⚠️

Location: engine/packages/pegboard-gateway/src/shared_state.rs:237-240

if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
    tracing::warn!("tunnel timeout");
    return Err(WebSocketServiceTimeout.build());
}

Issue: The ping task checks for timeout BEFORE sending the ping, but last_pong is only updated when a pong is received. On the very first ping after connection establishment, last_pong is initialized to util::timestamp::now() (shared_state.rs:119), which is fine. However, there's a subtle timing issue:

  • If the runner is slow to respond for exactly TUNNEL_PING_TIMEOUT, the next ping check will immediately fail
  • The timeout check happens before sending a new ping, so we don't give the current ping a chance to respond

Suggestion: Consider checking the timeout AFTER a failed ping response or implementing a separate "last ping sent" timestamp to track when we last attempted communication.

2. Inconsistent RTT Calculation Between Gateway and Runner

Location: engine/packages/pegboard-runner/src/ws_to_tunnel_task.rs:94-109

The runner calculates RTT as:

let delta = now.saturating_sub(ping.ts);
let rtt = delta * 2;  // Assuming symmetric delta

But the gateway calculates it as:

let rtt = now.saturating_sub(pong.ts);  // shared_state.rs:285

Issue: These measure different things:

  • Runner: Measures one-way latency × 2 (from gateway → runner)
  • Gateway: Measures actual round-trip time (gateway → runner → gateway)

The gateway's calculation is the true RTT, while the runner's is an estimate assuming symmetric latency. This might cause confusion when debugging or monitoring.

Suggestion: Add a comment explaining this difference, or consider having the runner echo back both timestamps so the gateway can calculate true RTT on both sides.

3. Missing Error Context in Pong Handler

Location: engine/packages/pegboard-gateway/src/shared_state.rs:271-282

Ok(protocol::ToGateway::ToGatewayPong(pong)) => {
    let Some(mut in_flight) = self.in_flight_requests.get_async(&pong.request_id).await
    else {
        tracing::debug!(
            request_id=?Uuid::from_bytes(pong.request_id),
            "in flight has already been disconnected, dropping ping"
        );
        continue;
    };

Minor Issue: This silently drops pongs for disconnected requests. While this is probably fine, it could indicate a timing issue or bug if it happens frequently.

Suggestion: Consider adding a metric to track dropped pongs to help identify potential issues in production.

4. Ping Task Continues After Error

Location: engine/packages/pegboard-gateway/src/ping_task.rs:21

shared_state.send_and_check_ping(request_id).await?;

The ping task will exit immediately if send_and_check_ping returns an error (including timeout). This is correct behavior, but ensure that the error propagates properly to trigger connection cleanup.

Verification Needed: Confirm that when the ping task returns an error, it properly triggers cleanup of the other tasks and the connection. From the code in lib.rs:446-467, it looks like this is handled correctly via the abort channels.

5. Clock Skew Handling

Location: engine/packages/pegboard-runner/src/ws_to_tunnel_task.rs:94-106

if ping.ts <= now {
    let delta_ms = now.saturating_sub(ping.ts);
    delta_ms.min(u32::MAX as i64) as u32
} else {
    tracing::warn!("ping timestamp is in the future, possibly due to clock skew");
    0
}

Good: The code handles clock skew gracefully by defaulting to 0 RTT when timestamps are in the future.

Suggestion: Consider setting a flag or metric when clock skew is detected, as this could indicate a systemic issue that needs attention.

6. Metrics Bucket Configuration

Location: engine/packages/pegboard-gateway/src/metrics.rs:10-12

pub static ref TUNNEL_PING_DURATION: Histogram<f64> = METER.f64_histogram("rivet_gateway_tunnel_ping_duration")
    .with_description("RTT of messages from gateway to pegboard.")
    .with_boundaries(BUCKETS.to_vec())

Question: What are the bucket boundaries defined in BUCKETS? For RTT measurements, typical values might range from 1ms to several seconds. Ensure the buckets are appropriate for the expected RTT range.

7. Missing Tracing Instrumentation

Location: engine/packages/pegboard-gateway/src/ping_task.rs

The ping task function is not instrumented with #[tracing::instrument], unlike other tasks.

Suggestion: Add instrumentation for consistency:

#[tracing::instrument(skip_all, fields(request_id=?Uuid::from_bytes(request_id)))]
pub async fn task(...)

🧪 Testing Concerns

No Test Coverage: This PR doesn't include any tests for the new ping/pong functionality. Consider adding:

  1. Unit tests for timeout calculations
  2. Integration tests for ping/pong message flow
  3. Tests for edge cases like clock skew, dropped pongs, and timeout scenarios

Per CLAUDE.md guidelines, when running tests: "pipe the test to a file in /tmp/ then grep it in a second step."


🔒 Security

No security concerns identified. The ping/pong mechanism doesn't introduce new attack vectors.


⚡ Performance

Minor Concern: The ping task runs every 3 seconds for every active WebSocket connection. For high connection counts, this could generate significant background traffic.

Metrics: The ping interval and timeout values seem reasonable:

  • UPDATE_PING_INTERVAL: 3 seconds
  • TUNNEL_PING_TIMEOUT: 30 seconds (allows ~10 ping attempts before timeout)

Consider monitoring the overhead in production with many concurrent connections.


📝 Code Quality

Good:

  • Follows existing error handling patterns using RivetError
  • Proper use of structured logging with tracing
  • Clean separation of concerns with dedicated task modules
  • Follows workspace dependency conventions

Minor Issues:

  • The ordering of serde and serde_json in Cargo.toml was changed (alphabetization) - this is fine but inconsistent with the note to "check existing imports before adding new ones"
  • Some #[allow(dead_code)] attributes added to struct fields in shared_state.rs:454-461 - consider if these fields are actually needed

✨ Suggestions for Future Work

  1. Add configurable ping interval and timeout via environment variables
  2. Consider adaptive ping intervals based on connection stability
  3. Add more detailed metrics (e.g., histogram of pong response times, dropped pong counter)
  4. Consider implementing exponential backoff for ping attempts before timeout

Verdict

This is a solid implementation with good architectural decisions. The main concerns are:

  1. The potential race condition in timeout checking (minor)
  2. Lack of test coverage (should be addressed)
  3. Missing observability for edge cases (dropped pongs, clock skew)

Recommendation: ✅ Approve with suggestions - The code is production-ready, but would benefit from additional tests and minor improvements to error handling/observability.


Review conducted following repository conventions from CLAUDE.md

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 11b75bb to 73e19ae Compare November 19, 2025 21:42
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3491 to 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack November 19, 2025 21:42
@NathanFlurry NathanFlurry marked this pull request as ready for review November 19, 2025 21:42
@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: Ping-Pong Implementation for Tunnel Protocol

This PR implements a ping-pong mechanism to detect and handle tunnel timeouts. Overall, the implementation is solid with good separation of concerns.

Strengths

  1. Good Architecture: The code properly separates ping logic into dedicated task modules on both gateway and runner sides
  2. Metrics Integration: Added proper OpenTelemetry metrics (TUNNEL_PING_DURATION) to track RTT
  3. Clean Refactoring: The main websocket handler logic was successfully extracted into separate task modules, improving code organization
  4. Proper Tracing: Added tracing::instrument attributes to key functions for better observability
  5. Structured Logging: Follows the repo's logging conventions with structured fields

Potential Issues

1. Ping Timeout Timing Mismatch (Critical)

The ping is sent every 3 seconds (UPDATE_PING_INTERVAL), but the timeout is 30 seconds (TUNNEL_PING_TIMEOUT). This means it would take 10 missed pings before timing out. Consider:

  • Is this intentional for network resilience?
  • Should there be a more aggressive timeout (e.g., 3-4x the ping interval = 9-12 seconds)?

2. Initial Ping Timing

In pegboard-gateway/src/ping_task.rs:13-21, the first ping is sent after waiting for UPDATE_PING_INTERVAL. This delays initial connection validation by 3 seconds. Consider sending the first ping immediately before entering the loop.

3. RTT Calculation Asymmetry

In pegboard-runner/src/ws_to_tunnel_task.rs:109-110, the code assumes symmetric network latency (rtt = delta * 2). While common, this may be inaccurate in asymmetric network conditions. The comment is good, but this limitation should be noted.

4. Error Handling in Ping Task

If send_and_check_ping returns an error, the ping task terminates without specific logging about why. Consider adding error context before returning.

5. Potential Race Condition

In shared_state.rs:266-267, there's a potential race between the ping task reading last_pong and the pong handler updating it. However, this is likely acceptable since the check is conservative (only leads to false negatives).

Code Quality Observations

1. Duplicate Abort Logic

The abort signal handling is duplicated across three join arms (lib.rs:435-473). Consider extracting to a helper function.

2. Magic Number in Metrics

The 0.001 constant in shared_state.rs:270 converts milliseconds to seconds. Consider adding a named constant for clarity.

Test Coverage

Missing: No tests were added for the new ping-pong functionality. Consider adding:

  • Unit tests for timeout detection logic
  • Integration tests for ping-pong flow
  • Tests for edge cases (clock skew, missed pongs, timeout scenarios)

Security & Performance

Security: No concerns identified. Proper timestamp validation and clock skew handling.

Performance: 3-second ping intervals are reasonable. Minimal metrics overhead. Good use of concurrent tasks.

Summary

Well-structured implementation that adds necessary timeout detection. Main concerns:

  • Action Required: Verify ping timeout values are intentional
  • Recommended: Send first ping immediately, add tests
  • Nice-to-Have: Improve error logging, reduce code duplication

The code follows repository conventions well and demonstrates good async Rust patterns.

@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Implement Ping-Pong for Tunnel

Overall Assessment

This is a well-structured implementation that adds a critical ping-pong mechanism to detect tunnel connection failures. The refactoring to separate tasks improves code organization and maintainability.


Strengths

  1. Good Code Organization: The refactoring into separate task modules significantly improves code readability and maintainability.
  2. Proper Timeout Handling: The implementation includes timeout detection with TUNNEL_PING_TIMEOUT (30s) to identify stale connections.
  3. Metrics Integration: The addition of TUNNEL_PING_DURATION metric provides valuable observability for tunnel health monitoring.
  4. Structured Logging: Good use of structured logging with tracing attributes throughout the code.

Issues and Concerns

1. Potential Race Condition in Ping Timeout Check (High Priority)

Location: engine/packages/pegboard-gateway/src/shared_state.rs:221

The ping task sends a ping every 3 seconds and immediately checks if the last pong was more than 30 seconds ago. However, the ping check happens AFTER sending the ping message which could lead to race conditions with message reordering or delivery delays.

Recommendation: Consider checking the timeout BEFORE sending the ping, not after. This makes the logic clearer.


2. Ping Task Doesn't Send First Ping Immediately (Medium Priority)

Location: engine/packages/pegboard-gateway/src/ping_task.rs:14-21

The ping task sleeps for 3 seconds before sending the first ping. If the connection is already dead, it takes 30 seconds to detect rather than 3 seconds.

Recommendation: Consider sending the first ping immediately before entering the sleep loop.


3. RTT Calculation Assumes Symmetric Latency (Low Priority)

Location: engine/packages/pegboard-runner/src/ws_to_tunnel_task.rs:109-110

The code multiplies delta by 2 assuming symmetric latency, which may not hold in real-world networks.

Recommendation: Document this assumption more prominently or use one-way latency directly.


4. Missing Error Context in Metrics (Low Priority)

Location: engine/packages/pegboard-gateway/src/shared_state.rs:270

The metric is recorded with no attributes, making it difficult to debug issues for specific gateways or runners.

Recommendation: Consider adding attributes for better observability.


5. Protocol Version Compatibility (Medium Priority)

The protocol change from ToRunnerKeepAlive to ToRunnerPing breaks backward compatibility.

Recommendation: Document the minimum required protocol version for this feature and consider adding compatibility checks.


6. Inconsistent Import Ordering (Style)

Location: engine/packages/pegboard-gateway/src/shared_state.rs:11

Per CLAUDE.md conventions, imports should be at the top of the file.


Test Coverage

Missing test scenarios:

  • Ping timeout detection (connection goes stale for >30s)
  • Ping task abort when other tasks fail
  • Concurrent ping/pong with high message volume
  • Clock skew handling
  • Metrics recording under various RTT conditions

Recommendation: Add integration tests to cover these scenarios.


Security Considerations

No security concerns identified. The ping-pong mechanism doesn't introduce any new attack surfaces.


Performance Considerations

  • Ping Frequency: 3-second intervals seem reasonable for tunnel health checks
  • Timeout Duration: 30 seconds is conservative but appropriate for production
  • Metrics Overhead: Minimal

Summary

This is a solid implementation that adds necessary tunnel health monitoring. The main concerns are:

  1. High Priority: Review the ping timeout check logic to avoid race conditions
  2. Medium Priority: Send first ping immediately to detect dead connections faster
  3. Medium Priority: Document protocol version compatibility requirements

The code is well-structured and follows the project's conventions. With the above issues addressed, this will be a valuable addition to the codebase.

Great work overall!

@MasterPtato MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 73e19ae to 1340648 Compare November 19, 2025 21:54
@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from e47241f to 6521b98 Compare November 19, 2025 21:54
@MasterPtato MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 1340648 to 633be91 Compare November 19, 2025 21:55
@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: Ping/Pong Implementation for Tunnels

Summary

This PR implements a ping/pong mechanism to monitor tunnel health between the gateway and pegboard runners. The implementation is well-structured with good separation of concerns through dedicated task modules. Overall, this is solid work with good architectural decisions.


✅ Strengths

  1. Good code organization - Splitting the monolithic lib.rs into separate task modules (ping_task.rs, tunnel_to_ws_task.rs, ws_to_tunnel_task.rs) significantly improves readability and maintainability.

  2. Proper metrics tracking - The RTT measurement and OpenTelemetry integration in metrics.rs provides good observability.

  3. Consistent protocol design - The ping/pong protocol additions follow the existing patterns with ToRunnerPing and ToGatewayPong types.

  4. Task coordination - The abort mechanism with watch channels properly coordinates the three concurrent tasks.


🐛 Issues & Concerns

High Priority

1. Potential Race Condition in last_pong Update (shared_state.rs:119, shared_state.rs:267)

In start_in_flight_request(), you initialize last_pong when creating a new request, but when re-using an existing Entry::Occupied, you don't reset last_pong. This means if a request is being restarted/reconnected, it could immediately timeout if the old last_pong timestamp is too old.

Entry::Occupied(mut entry) => {
    entry.receiver_subject = receiver_subject;
    entry.msg_tx = msg_tx;
    entry.drop_tx = drop_tx;
    entry.opened = false;
    entry.message_index = 0;
    // Missing: entry.last_pong = util::timestamp::now();

Recommendation: Reset last_pong in the Occupied branch as well.

2. Ping Task Error Handling Could Cause Silent Failures (ping_task.rs:21)

The ping task calls send_and_check_ping() which can return an error (including timeout errors). If this happens, the error propagates and terminates the ping task, which then terminates all other tasks. However, this might be too aggressive - a transient network issue could kill an otherwise healthy connection.

shared_state.send_and_check_ping(request_id).await?;

Recommendation: Consider whether timeout detection should be handled more gracefully, perhaps with retry logic or exponential backoff before terminating the connection.

3. Missing RTT Tracking on Runner Side (tunnel_to_ws_task.rs:57-80)

The runner responds to pings but doesn't track its own RTT or connection health. This means the runner has no way to detect if the gateway has gone away or if the connection is degraded.

Recommendation: Consider implementing symmetric ping/pong where both sides track RTT and can detect connection issues.

Medium Priority

4. Hardcoded Timeout Constants Need Documentation

const TUNNEL_PING_TIMEOUT: i64 = util::duration::seconds(30);
const UPDATE_PING_INTERVAL: Duration = Duration::from_secs(3);

These constants define critical behavior but lack documentation explaining:

  • Why 30 seconds was chosen
  • What the expected RTT range is
  • How this relates to UPDATE_PING_INTERVAL (3 seconds → ~10 missed pings before timeout)

Recommendation: Add doc comments explaining the rationale and relationship between these values.

5. Inconsistent Timeout Type (i64 vs Duration) (shared_state.rs:19)

TUNNEL_PING_TIMEOUT uses i64 (milliseconds), while UPDATE_PING_INTERVAL uses Duration. This inconsistency makes it harder to understand the relationship between them.

Recommendation: Use Duration consistently and convert to millis only when needed for timestamp comparisons.

6. Metrics Missing Attributes (metrics.rs:9)

The comment says "Has no expected attributes" but adding attributes like gateway_id or aggregated connection state would make debugging much easier.

Recommendation: Consider adding optional attributes to correlate metrics with specific gateways or connection states.

7. GC Debug Messages Use #[allow(dead_code)] (shared_state.rs:444-447)

WebSocketMessageNotAcked {
    #[allow(dead_code)]
    first_msg_index: u16,
    #[allow(dead_code)]
    last_msg_index: u16,
},

These fields are only used in debug output but marked as dead code. This suggests the debug formatting might not be as useful as intended.

Recommendation: Either use these in structured logging or remove them if not providing value.


🔍 Minor Issues

8. Code Style: Import Ordering (lib.rs:12-13)

use rivet_guard_core::{
    WebSocketHandle,
    custom_serve::{CustomServeTrait, HibernationResult},

According to CLAUDE.md, imports should follow a consistent pattern. The WebSocketHandle import should be grouped with other imports alphabetically.

9. Commented Out Dead Code (shared_state.rs:543-545)

// fn wrapping_lt(a: u16, b: u16) -> bool {
//     b.wrapping_sub(a) < u16::MAX / 2
// }

Per CLAUDE.md: "Avoid backwards-compatibility hacks like... adding // removed comments for removed code, etc. If something is unused, delete it completely."

Recommendation: Remove the commented code.

10. Pattern Match Result Type Collision (tunnel_to_ws_task.rs:48)

let msg = match versioned::ToRunner::deserialize_with_embedded_version(&ups_msg.payload) {
    Result::Ok(x) => x,

Using Result::Ok instead of Ok suggests a potential type collision. This works but is unusual.

Recommendation: Verify if this is necessary or if standard Ok(x) would work.


✨ Suggestions

  1. Add integration tests - This ping/pong mechanism is critical for connection health. Consider adding tests that simulate:

    • Normal ping/pong exchange
    • Missed pings leading to timeout
    • Recovery from transient failures
  2. Consider backpressure - If the runner is overloaded and slow to respond to pings, should the gateway handle this differently than a dead connection?

  3. Logging improvements - Consider adding structured logging for ping RTT at regular intervals (not just in metrics) to aid debugging.


📊 Test Coverage

The PR doesn't include tests for the ping/pong mechanism. Given this is a critical reliability feature, I'd strongly recommend adding:

  • Unit tests for timeout detection logic
  • Integration tests for the full ping/pong cycle
  • Tests for edge cases (connection interruption, delayed pongs, etc.)

Security & Performance

No security concerns identified
Performance looks good - Ping interval of 3 seconds is reasonable and shouldn't cause overhead


Conclusion

This is a solid implementation that significantly improves tunnel reliability. The main concerns are:

  1. The race condition in last_pong initialization (should be fixed)
  2. Error handling strategy in the ping task (consider retry logic)
  3. Missing test coverage

Once the last_pong race condition is addressed, this would be ready to merge. Great work on the refactoring and separation of concerns!

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: fix(tunnel): implement ping pong

Summary

This PR implements a ping/pong mechanism for tunnel connections between the gateway and pegboard runner to detect connection failures and measure round-trip time (RTT). The implementation adds active health monitoring to replace the previous passive keep-alive approach.


✅ Strengths

1. Good Code Organization

  • Excellent refactoring that extracts tunnel_to_ws_task, ws_to_tunnel_task, and ping_task into separate modules, improving code readability and maintainability
  • Clear separation of concerns between the three concurrent tasks

2. Proper Concurrency Handling

  • Well-structured use of tokio::select! with appropriate abort channels
  • Correct handling of task lifecycle with LifecycleResult::Aborted pattern
  • Good use of watch channels for coordinating task shutdown

3. Observability

  • Added metrics for tunnel ping duration (TUNNEL_PING_DURATION) which is valuable for monitoring
  • Comprehensive tracing with structured logging using request IDs

4. Protocol Design

  • Clean protocol extension with ToRunnerPing and ToGatewayPong messages
  • Preserves timestamp from ping in pong for accurate RTT calculation

⚠️ Issues & Concerns

1. Potential Race Condition in Ping Timeout (Medium Severity)

In shared_state.rs:210-221, the ping task sends a ping but checks last_pong before sending:

let now = util::timestamp::now();

// Verify ping timeout
if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
    tracing::warn!("tunnel timeout");
    return Err(WebSocketServiceTimeout.build());
}

// Send message
let message = protocol::ToRunner::ToRunnerPing(...)

Issue: If the connection has been dead for exactly 30 seconds, this will send one final ping before timing out. This is wasteful and could mask the actual timeout condition.

Recommendation: Check the timeout after attempting to send, or skip sending if we're about to timeout.


2. Missing last_pong Initialization Validation (Low-Medium Severity)

In shared_state.rs:119, new in-flight requests initialize last_pong: util::timestamp::now(). However, there's no validation that the first pong is received within a reasonable timeframe.

Scenario: A connection could be established, but if the first ping/pong exchange fails, it would take 30 seconds to detect the failure (from TUNNEL_PING_TIMEOUT).

Recommendation: Consider adding an initial handshake validation or reducing the effective timeout for the first ping/pong exchange.


3. Integer Overflow in Timestamp Arithmetic (Low Severity)

shared_state.rs:260 uses saturating_sub correctly:

let rtt = now.saturating_sub(pong.ts);

However, this could theoretically return 0 if clocks are skewed. While unlikely in practice with monotonic timestamps, consider:

  • Adding validation that rtt >= 0 and logging warnings for negative RTT
  • Using a more robust time source or documenting clock sync requirements

4. Inconsistent Error Handling in update_runner_ping (Low Severity)

In pegboard-runner/src/ping_task.rs:33-35, if the workflow doesn't exist, the code logs an error but returns Ok(()):

else {
    tracing::error!(?conn.runner_id, "workflow does not exist");
    return Ok(());
};

Question: Should this be a fatal error that stops the ping task, or is continuing silently the intended behavior? The current approach could hide genuine issues.

Recommendation: Consider returning an error or at least using tracing::warn! instead of tracing::error! if this is expected behavior.


5. Task Coordination Logic Could Be Simplified (Code Quality)

In lib.rs:426-448, the abort logic is duplicated with multiple abort channel clones:

let tunnel_to_ws_abort_tx2 = tunnel_to_ws_abort_tx.clone();
let ws_to_tunnel_abort_tx2 = ws_to_tunnel_abort_tx.clone();
let ping_abort_tx2 = ping_abort_tx.clone();

Recommendation: Consider using a shared abort token or a single broadcast channel to simplify the coordination logic. This would make the shutdown logic more maintainable.


6. Missing Tracing Instrumentation (Minor)

The new task functions in tunnel_to_ws_task.rs and ws_to_tunnel_task.rs lack #[tracing::instrument] attributes that are present elsewhere in the codebase (e.g., ping_task.rs:9).

Recommendation: Add consistent instrumentation:

#[tracing::instrument(skip_all, fields(request_id=?tunnel_id::request_id_to_string(&request_id)))]
pub async fn task(...)

7. Import Organization (Minor - Style)

guard-core/src/custom_serve.rs:6 shows an import being moved:

+use pegboard::tunnel::id::RequestId;
 
 use crate::WebSocketHandle;
 use crate::proxy_service::ResponseBody;
 use crate::request_context::RequestContext;
 
-use pegboard::tunnel::id::RequestId;

According to CLAUDE.md: "Always add imports at the top of the file inside of inline within the function." This change is correct, but ensure consistency across all modified files.


🧪 Testing Concerns

1. Missing Test Coverage

There are no new test files added for:

  • Ping/pong message exchange
  • Timeout behavior when pongs stop arriving
  • RTT calculation accuracy
  • Task coordination and abort scenarios

Recommendation: Add integration tests covering:

- Happy path: ping sent, pong received, RTT recorded
- Timeout path: no pong received within TUNNEL_PING_TIMEOUT
- Task cleanup: all tasks properly abort on connection close

2. No Validation of Metrics

The new TUNNEL_PING_DURATION metric is recorded but there's no test validating:

  • Metric is actually recorded
  • RTT values are reasonable (not negative, not absurdly large)

🔒 Security Considerations

1. No Major Security Issues Detected

The ping/pong implementation doesn't introduce obvious security vulnerabilities. However:

  • Timestamp Trust: The gateway trusts the timestamp from the runner's pong. In a malicious scenario, a runner could send fake timestamps to manipulate RTT metrics. This is likely acceptable given the trust model.

  • DoS via Ping Flooding: The UPDATE_PING_INTERVAL (3 seconds) is reasonable and won't cause DoS, but ensure rate limiting exists at the protocol level if not already present.


📊 Performance Considerations

1. Reasonable Ping Interval

UPDATE_PING_INTERVAL = 3 seconds is a good balance between responsiveness and overhead.

2. Metrics Collection

The histogram collection (metrics::TUNNEL_PING_DURATION) has minimal overhead, using pre-defined buckets.

3. Lock Contention

shared_state.rs uses SCC's HashMap which is lock-free, so the additional last_pong field won't introduce contention.


📝 Recommendations Summary

High Priority:

  1. Fix the race condition in ping timeout checking

Medium Priority:
2. Add validation for initial ping/pong handshake
3. Add integration tests for ping/pong behavior
4. Clarify error handling in update_runner_ping

Low Priority:
5. Add tracing instrumentation to task functions
6. Consider simplifying task abort coordination
7. Add validation for RTT calculation edge cases


✅ Conclusion

This is a solid implementation of ping/pong for tunnel health monitoring. The code is well-structured and follows good Rust practices. The main concerns are around edge case handling and test coverage. With the recommended fixes, this will be a robust addition to the codebase.

Recommended Action: Address the high/medium priority items before merging, especially the ping timeout race condition and test coverage.

Great work on the refactoring! The separation into task modules makes the code much more maintainable. 🚀

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: Ping-Pong Implementation for Tunnel

Summary

This PR implements a WebSocket ping-pong mechanism between the pegboard gateway and runner to detect stale connections and measure round-trip time (RTT). The implementation adds proper connection health monitoring to the tunnel protocol.


✅ Strengths

  1. Good Architecture: The refactoring of monolithic async tasks into separate modules (ping_task.rs, tunnel_to_ws_task.rs, ws_to_tunnel_task.rs) significantly improves code organization and maintainability.

  2. Proper Protocol Evolution: The protocol changes from ToRunnerKeepAlive to ToRunnerPing/ToGatewayPong with timestamps enable RTT measurement, which is more useful than a simple keepalive.

  3. Metrics Integration: Adding TUNNEL_PING_DURATION histogram provides valuable observability into tunnel health (gateway/src/metrics.rs:10-13).

  4. Consistent Timeout Handling: The TUNNEL_PING_TIMEOUT of 30 seconds with UPDATE_PING_INTERVAL of 3 seconds provides a reasonable 10x safety margin.

  5. Good Error Recovery: The ping task refactoring in runner (runner/src/ping_task.rs:27-66) properly extracts update_runner_ping into a separate function with better error handling.


🔍 Issues & Concerns

1. Potential Ordering Issue in Abort Logic ⚠️

Location: engine/packages/pegboard-gateway/src/lib.rs:435-436, 450-451, 465-466

The abort signals are sent using .send() which can fail, but errors are ignored:

let _ = ping_abort_tx.send(());
let _ = ws_to_tunnel_abort_tx.send(());

Issue: If these sends fail (e.g., receiver already dropped), the tasks won't be properly aborted. This could lead to resource leaks or zombie tasks.

Recommendation:

  • Consider logging when abort signals fail to send
  • Or verify that ignoring failures is intentional (e.g., receiver already completed)

2. Timestamp Overflow Potential ⚠️

Location: engine/packages/pegboard-gateway/src/shared_state.rs:218

if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {

The comparison uses saturating_sub, which is good, but TUNNEL_PING_TIMEOUT is defined as i64 milliseconds (30,000ms). If now or req.last_pong are corrupted or uninitialized, this could fail silently.

Recommendation: Add validation that req.last_pong is reasonable (e.g., not zero, not far in the future).

3. Race Condition in Ping Update ⚠️

Location: engine/packages/pegboard-gateway/src/shared_state.rs:265-267

let now = util::timestamp::now();
in_flight.last_pong = now;

The last_pong timestamp is updated in the pong handler, but there's no synchronization with the ping checker. If a ping timeout check happens between receiving the pong message and updating last_pong, it could trigger a false timeout.

Recommendation: Consider atomic operations or ensure the update happens before any timeout checks can observe the old value.

4. Missing Error Context

Location: engine/packages/pegboard-gateway/src/shared_state.rs:215

.context("request not in flight")?;

This error message doesn't include the request_id, making debugging difficult.

Recommendation:

.with_context(|| format!("request not in flight: {:?}", tunnel_id::request_id_to_string(&request_id)))?;

5. Incomplete Close Frame Handling

Location: engine/packages/pegboard-gateway/src/lib.rs:476-486

The logic to determine the final lifecycle_res from three tasks prefers the first non-aborted result:

(Ok(res), Ok(LifecycleResult::Aborted), _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _) => Ok(res),
// Unlikely case
(res, _, _) => res,

Issue: If the ping task times out (returns error) while the other two tasks complete successfully, the error takes precedence. This might close the connection even if data transfer was successful.

Recommendation: Consider whether ping timeout should take precedence over successful close from other tasks, or if it should only abort if both other tasks are still running.

6. Protocol Breaking Change ⚠️

Location: engine/sdks/schemas/runner-protocol/v3.bare

The protocol changed from:

type ToRunnerKeepAlive void

to:

type ToRunnerPing struct {
    gatewayId: GatewayId
    requestId: RequestId
    ts: i64
}

Issue: This is a breaking protocol change. Old runners won't understand the new ping format.

Question: Is there a protocol version negotiation? Should this be v4 instead of modifying v3?


🎯 Security Considerations

  1. DoS via Timestamp Manipulation: If an attacker could manipulate timestamps, they could cause false timeouts. The implementation uses util::timestamp::now() which should be monotonic, but verify it's using a trusted clock source.

  2. Resource Exhaustion: The ping task runs every 3 seconds for each connection. For many connections, this could generate significant load. Consider:

    • Rate limiting ping operations
    • Batching ping updates
    • Monitoring ping task count

📊 Performance Considerations

  1. Metrics Overhead: Each ping generates a histogram sample. For high-connection-count scenarios, verify this doesn't impact performance (gateway/src/metrics.rs:10-13).

  2. GC Impact: The TUNNEL_PING_TIMEOUT affects GC behavior in shared_state.rs:482. Document the relationship between ping timeout and GC cycles.

  3. Lock Contention: in_flight_requests is accessed on every ping/pong. For high throughput, consider if this becomes a bottleneck (shared_state.rs:216).


🧪 Test Coverage

Critical Issue: No tests were added for the new ping-pong functionality.

Recommendations:

  • Add unit tests for send_and_check_ping timeout logic
  • Add integration tests for ping-pong flow
  • Test race conditions (concurrent ping timeout and pong receipt)
  • Test protocol serialization/deserialization for new message types
  • Test abort signal handling in all three tasks

📝 Documentation

Missing:

  • No documentation on the ping-pong protocol behavior
  • No comments explaining timeout values (why 3s and 30s?)
  • The relationship between ping timeout and hibernation needs documentation

Recommendation: Add module-level docs explaining:

  • The purpose of ping-pong (connection health, RTT measurement)
  • Timeout values and their rationale
  • How ping interacts with hibernation

🔧 Code Style

Good:

  • Follows Rust conventions
  • Proper use of structured logging with tracing
  • Good error propagation with anyhow

Minor Issues:

  1. Unused #[allow(dead_code)] attributes in shared_state.rs:444-445 suggest incomplete implementation or unnecessary fields
  2. Comment in runner protocol (v3.bare:18) says "TODO: Rename to update_ts" - should this be addressed?

💡 Suggestions

  1. Add Exponential Backoff: If ping fails transiently, exponential backoff could reduce unnecessary connection drops.

  2. Configurable Timeouts: Make UPDATE_PING_INTERVAL and TUNNEL_PING_TIMEOUT configurable rather than hardcoded constants.

  3. Ping Jitter: Add small random jitter to ping intervals to avoid thundering herd with many connections.

  4. Connection State Machine: Consider formalizing the connection states (CONNECTED, PINGING, TIMEOUT, etc.) for better debugging.


🎬 Conclusion

This is a solid implementation that adds necessary connection health monitoring. The code refactoring improves maintainability significantly. However, there are several important issues to address before merging:

Must Fix:

  • Clarify protocol version compatibility (breaking change concern)
  • Add tests for ping-pong functionality
  • Review abort signal error handling

Should Fix:

  • Add error context with request IDs
  • Document timeout values and behavior
  • Consider race condition in last_pong update

Nice to Have:

  • Add configurability for timeouts
  • Add ping jitter
  • Improve metrics documentation

Overall: Approve with changes - The implementation is sound but needs test coverage and protocol version clarification before merging.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Tunnel Ping-Pong Implementation

I've reviewed this PR implementing ping-pong functionality for tunnel connections. Overall, the implementation is solid with good architectural decisions.

Strengths

  1. Well-structured refactoring into separate task modules (ping_task.rs, tunnel_to_ws_task.rs, ws_to_tunnel_task.rs)
  2. Proper protocol versioning with backwards compatibility in v3.bare schema
  3. Good timeout handling: 30-second ping timeout with 3-second interval
  4. Metrics integration: TUNNEL_PING_DURATION histogram for RTT tracking
  5. Clean task lifecycle management with abort channels

Critical Issues

  1. Potential race condition (shared_state.rs:221): send_and_check_ping checks last_pong for timeout, but it's initialized to now() at request start. If first ping happens before runner fully connects, could cause false timeout. Consider only starting ping task after tunnel fully established.

  2. Inconsistent error handling (lib.rs:420-430): Result handling has unclear fallback case. The (res, _, _) => res only returns tunnel_to_ws, ignoring other task results. Need explicit handling or documentation.

Medium Priority

  1. Logging inconsistency (shared_state.rs:261): Says "dropping ping" but handling ToGatewayPong - should say "dropping pong"

  2. Missing tracing span: ping_task.rs lacks .instrument() span unlike other tasks

  3. Incorrect allow(dead_code) (shared_state.rs:443-448): Fields are used in debug output, shouldn't be marked dead

Testing Concerns

  • No test additions for significant protocol change
  • Need integration tests for ping timeout scenarios
  • Should test hibernation interaction with ping task

Summary

Well-implemented feature with solid architecture. Address critical race condition and error handling issues before merge.

Recommendation: Changes requested for items 1-2, others are minor improvements.

@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from e98d4d0 to 9d2a169 Compare November 20, 2025 22:38
@MasterPtato MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from d287067 to 40a9f30 Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Implement Ping/Pong for Tunnel

Overview

This PR implements a ping/pong heartbeat mechanism between the gateway and pegboard runner to detect connection failures and measure RTT. The implementation is well-structured with dedicated tasks for ping management and proper protocol extensions.

✅ Strengths

1. Clean Architecture

  • Excellent refactoring of websocket handling logic into separate task modules (tunnel_to_ws_task.rs, ws_to_tunnel_task.rs, ping_task.rs)
  • Consistent task structure across both gateway and runner implementations
  • Good separation of concerns with dedicated ping tasks

2. Protocol Design

  • Proper ping/pong message types added to the protocol schema (ToRunnerPing, ToGatewayPong)
  • Backward compatible changes to the versioned protocol
  • RTT tracking through timestamp passing in ping/pong messages

3. Monitoring & Observability

  • New metrics module with TUNNEL_PING_DURATION histogram for tracking RTT
  • Good use of structured logging with tracing throughout
  • Proper instrumentation on key functions

4. Timeout Handling

  • Well-defined timeout constant: TUNNEL_PING_TIMEOUT = 30s
  • Ping interval: UPDATE_PING_INTERVAL = 3s (appropriate for 30s timeout)
  • Proper timeout detection in send_and_check_ping() (shared_state.rs:221)

⚠️ Issues & Concerns

1. Potential Race Condition in Ping Timeout Check
engine/packages/pegboard-gateway/src/shared_state.rs:211-243

The send_and_check_ping function checks the timeout and then sends a ping, but there's a potential race:

// Line 221: Check timeout
if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
    tracing::warn!("tunnel timeout");
    return Err(WebSocketServiceTimeout.build());
}

// Line 227-241: Send ping (after the check)

Issue: If a pong is received between the timeout check (line 221) and the next ping send, last_pong is updated but we've already passed the timeout check. On the next iteration, if the pong is slightly delayed, we could false-positive timeout.

Recommendation: Consider checking last_pong immediately before the timeout check is made, or use an atomic operation to ensure consistency. Alternatively, document this as acceptable behavior since the window is small.

2. Silent Ping/Pong Handling
engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:57-80

The runner receives pings and sends pongs through UPS, but there's no logging at debug level for successful ping/pong exchanges. While trace logging exists for the message reception, explicit ping/pong success isn't logged.

Recommendation: Add debug-level logging when pongs are sent to aid in troubleshooting connection issues:

tracing::debug!(?ping.request_id, "received ping, sending pong");

3. Missing Atomic RTT Update
engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:57-77

The pong handler doesn't update conn.last_rtt, but the ping task reads from it:

// ping_task.rs:43
let rtt = conn.last_rtt.load(Ordering::Relaxed);

Critical Issue: The RTT is never updated from the ping/pong mechanism! The old ping mechanism might have populated this, but with the new implementation, last_rtt appears stale.

Recommendation: Calculate and update RTT in the pong handler:

let rtt = util::timestamp::now().saturating_sub(ping.ts);
conn.last_rtt.store(rtt, Ordering::Relaxed);

4. Inconsistent Logging Style
engine/packages/pegboard-gateway/src/shared_state.rs:261

"in flight has already been disconnected, dropping ping"

The log says "dropping ping" but this is actually handling a pong. Should be "dropping pong" for consistency.

5. GC Reason Struct with Dead Code Attributes
engine/packages/pegboard-gateway/src/shared_state.rs:443-448

WebSocketMessageNotAcked {
    #[allow(dead_code)]
    first_msg_index: u16,
    #[allow(dead_code)]
    last_msg_index: u16,
},

Recommendation: These fields are useful for debugging. Either remove the #[allow(dead_code)] and actually use them in logging, or remove the fields if truly not needed.

6. Unclear Task Result Handling
engine/packages/pegboard-gateway/src/lib.rs:418-430

The logic for determining the final result from 3 tasks is complex:

let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res) {
    (Err(err), _, _) => Err(err),
    (_, Err(err), _) => Err(err),
    (_, _, Err(err)) => Err(err),
    (Ok(res), Ok(LifecycleResult::Aborted), _) => Ok(res),
    (Ok(LifecycleResult::Aborted), Ok(res), _) => Ok(res),
    (res, _, _) => res,  // "Unlikely case"
};

Issue: The "unlikely case" comment suggests this is unexpected, but it's not clear what happens if the ping task returns a non-aborted success (which should never happen in practice since ping loops forever).

Recommendation: Add explicit handling or an assertion for impossible states to make the invariants clear.

7. Missing Error Context
engine/packages/pegboard-gateway/src/ping_task.rs:21

shared_state.send_and_check_ping(request_id).await?;

If this errors, the context isn't clear that it's from the ping task.

Recommendation: Add context:

shared_state.send_and_check_ping(request_id).await
    .context("ping task failed to send/check ping")?;

📋 Minor Issues

1. Unused Import Warning Potential
The protocol schema shows deprecated ToRunnerKeepAlive (line 411 in schema), but the code has been updated to ToRunnerPing. Ensure the deprecated type is eventually removed once old clients are updated.

2. Code Style - Import Ordering
engine/packages/guard-core/src/custom_serve.rs:4-7

The imports are reordered with pegboard import moved. While functional, consider whether this was intentional or a side effect of auto-formatting.

3. Metric Units
engine/packages/pegboard-gateway/src/shared_state.rs:270

metrics::TUNNEL_PING_DURATION.record(rtt as f64 * 0.001, &[]);

The conversion from milliseconds to seconds is correct, but adding a comment would help:

// Record RTT in seconds (convert from milliseconds)
metrics::TUNNEL_PING_DURATION.record(rtt as f64 * 0.001, &[]);

🔒 Security Considerations

No significant security issues identified. The ping/pong mechanism:

  • Uses existing authenticated channels (UPS)
  • Doesn't introduce new attack surfaces
  • Proper timeout handling prevents resource exhaustion

🧪 Testing Recommendations

Consider adding tests for:

  1. Ping timeout triggering proper connection termination
  2. RTT calculation accuracy
  3. Behavior when pongs are lost but connection is alive
  4. Metrics recording correctness

📊 Performance Considerations

Positive:

  • Ping interval (3s) is reasonable and won't cause excessive overhead
  • Async task structure minimizes blocking
  • Metrics use histograms appropriately

Consider:

  • Monitor the metrics in production to ensure 3s ping interval is appropriate
  • The 30s timeout seems generous - could be tuned based on production data

✨ Summary

This is a solid implementation of a critical reliability feature. The main concern is the missing RTT update in the runner's pong handler (Issue #3), which should be addressed before merge. The other issues are minor but worth considering for improved debuggability and code clarity.

Recommendation: Request changes to address Issue #3 (missing RTT update), then approve with the other issues noted for follow-up improvements.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Implement Ping/Pong for Tunnel Connection Health

This PR implements a ping/pong mechanism to maintain and monitor tunnel connection health between gateway and runner. Overall, this is a solid implementation with good code organization. Here's my detailed review:

✅ Strengths

1. Protocol Design

  • Clean protocol extension with ToRunnerPing and ToGatewayPong messages (v3.bare:402-428)
  • Proper timestamp-based RTT calculation
  • Good separation of concerns between runner-to-gateway control plane messaging

2. Code Organization

  • Excellent refactoring: extracted tunnel_to_ws_task.rs, ws_to_tunnel_task.rs, and ping_task.rs into separate modules
  • Makes the codebase more maintainable and testable
  • Clear separation of responsibilities

3. Metrics Implementation

  • Added TUNNEL_PING_DURATION histogram to track RTT (metrics.rs:10-13)
  • Proper use of OpenTelemetry standards
  • Converts milliseconds to seconds for metric recording (shared_state.rs:270)

4. Timeout Handling

  • Implements both ping interval (3s) and timeout (30s) constants
  • Proper use of saturating_sub to prevent underflow when calculating elapsed time (shared_state.rs:221)
  • Gateway verifies last pong timestamp before sending next ping

🔍 Issues & Suggestions

1. Potential Race Condition in Ping Timeout Check ⚠️
pegboard-gateway/src/shared_state.rs:221-223

if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
    tracing::warn!("tunnel timeout");
    return Err(WebSocketServiceTimeout.build());
}

Issue: The check happens before sending the ping. If the connection is slow but alive, this could timeout prematurely since last_pong might be stale from the previous interval.

Suggestion: Consider checking timeout only after a ping fails, or add a grace period. Alternatively, track "pings sent without response" counter.

2. Missing Error Context in Pong Handler
pegboard-gateway/src/shared_state.rs:261

The log message says "dropping ping" but should say "dropping pong":

"in flight has already been disconnected, dropping ping"

Should be: "in flight has already been disconnected, dropping pong"

3. Async Lock Contention in Runner Ping Task ℹ️
pegboard-runner/src/ping_task.rs:27-66

The update_runner_ping function is called every 3 seconds and performs multiple async operations (workflow fetch, op call). While generally fine, under high load this could cause ping intervals to drift.

Suggestion: Consider using tokio::time::MissedTickBehavior::Burst or adding telemetry to track drift.

4. Ping Task Cleanup Order 🤔
pegboard-gateway/src/lib.rs:425-472

The cleanup logic with three concurrent tasks (tunnel_to_ws, ws_to_tunnel, ping) uses nested watch channels for abortion. The current approach works but is somewhat complex.

Consider: The ping task timeout error will abort the other tasks, which is correct. However, documenting this flow would help future maintainers understand the intended behavior.

5. TypeScript Stringify Utility 👍
Great addition of stringify.ts for better logging! This will significantly help debugging. One minor note:

  • The file is comprehensive but quite large (336 lines)
  • Consider splitting into multiple files if it grows further

6. Constants Naming & Documentation 📝

const UPDATE_PING_INTERVAL: Duration = Duration::from_secs(3);
const TUNNEL_PING_TIMEOUT: i64 = util::duration::seconds(30);

Suggestion: Add documentation comments explaining the relationship between these values (timeout should be >> interval * some factor).

7. Test Coverage ⚠️
I don't see any new tests for the ping/pong mechanism. Given that this is critical for connection health:

Recommendation: Add tests covering:

  • Ping timeout triggers connection close
  • Pong updates last_pong timestamp
  • RTT metrics are recorded correctly
  • Race conditions with request cleanup

🔒 Security Considerations

✅ No security issues identified. The ping/pong mechanism:

  • Uses existing authentication/authorization
  • Doesn't expose new attack surface
  • Timestamps are server-controlled (no client manipulation)

🎯 Performance Considerations

Positive:

  • 3-second ping interval is reasonable for most use cases
  • Metrics recording is lightweight
  • Async tasks don't block main event loop

Potential Concern:

  • At scale with many concurrent tunnels, 3-second pings could generate significant pubsub traffic
  • Current: N tunnels * 1 ping/3s = N/3 messages/sec
  • Consider making ping interval configurable per deployment scale

📋 Minor Code Quality Notes

  1. Import ordering (guard-core/src/custom_serve.rs:4-7): The imports were reordered, which is good for consistency.

  2. Consistent error handling: Good use of anyhow::Result and custom error types throughout.

  3. Tracing instrumentation: Good use of structured logging with #[tracing::instrument] macros.

🎓 Recommendations

Must Fix:

  1. Fix the "dropping ping" → "dropping pong" log message
  2. Add test coverage for ping/pong mechanism

Should Consider:

  1. Review ping timeout logic to prevent premature timeouts
  2. Add documentation for ping interval/timeout relationship
  3. Add metrics for ping failures/timeouts

Nice to Have:

  1. Make ping interval configurable
  2. Add drift tracking for ping tasks under load
  3. Document task cleanup flow in comments

Overall Assessment

This is a well-implemented feature that adds important connection health monitoring to the tunnel system. The code is clean, well-organized, and follows project conventions. With a few minor fixes and test coverage, this will be production-ready.

Approval: ✅ Approved with minor suggestions

Great work on the refactoring and clean separation of concerns! The new module structure is much more maintainable.

@NathanFlurry NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from 9d2a169 to e98d4d0 Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 40a9f30 to d287067 Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Implement Ping-Pong for Tunnel Connections

Overview

This PR implements ping-pong heartbeat functionality between the gateway and runner for tunnel connections. The implementation adds proactive health checking to detect and handle tunnel timeouts more effectively.


✅ Strengths

1. Good Architectural Separation

  • Excellent refactoring of lib.rs by extracting tasks into dedicated modules (ping_task.rs, tunnel_to_ws_task.rs, ws_to_tunnel_task.rs)
  • Clean separation of concerns makes the code more maintainable
  • Consistent pattern between gateway and runner implementations

2. Proper Error Handling

  • Timeout detection is implemented correctly with TUNNEL_PING_TIMEOUT (30s)
  • Proper use of saturating_sub to prevent integer underflow
  • Graceful handling of missing in-flight requests

3. Observability

  • Added metrics tracking (TUNNEL_PING_DURATION) for RTT monitoring
  • Good use of structured logging with tracing::instrument
  • RTT metric converted to seconds (rtt as f64 * 0.001)

4. Protocol Design

  • Protocol changes in v3.bare are minimal and backward-compatible in structure
  • Ping/pong messages include timestamps for RTT calculation
  • Request ID included for proper message routing

🔍 Issues & Concerns

Critical: Potential Race Condition

Location: engine/packages/pegboard-gateway/src/shared_state.rs:221-224

// Verify ping timeout
if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
    tracing::warn!("tunnel timeout");
    return Err(WebSocketServiceTimeout.build());
}

Issue: The timeout check happens BEFORE sending the ping, but it compares against last_pong which is only updated when a pong is received. This means:

  • First ping is sent immediately (at t=0, last_pong was just set to now on line 45 in start_in_flight_request)
  • Subsequent pings check if the PREVIOUS pong was received within timeout
  • If a single pong is lost/delayed, the connection is terminated even if the tunnel is healthy

Suggestion: Consider one of these approaches:

  1. Track last_ping_sent separately and check if now - last_ping_sent > TIMEOUT
  2. Only fail after N consecutive missed pongs
  3. Document why this behavior is intentional if it is

Medium: Timer Placement Issue

Location: engine/packages/pegboard-gateway/src/ping_task.rs:13-21

loop {
    tokio::select! {
        _ = tokio::time::sleep(UPDATE_PING_INTERVAL) => {}
        _ = ping_abort_rx.changed() => {
            return Ok(LifecycleResult::Aborted);
        }
    }
    
    shared_state.send_and_check_ping(request_id).await?;
}

Issue: The sleep happens BEFORE the first ping is sent. This means:

  • Connection established at t=0
  • First ping sent at t=3s (after UPDATE_PING_INTERVAL)
  • If send_and_check_ping takes time or fails, the interval drifts

Suggestion: Move the ping before the sleep or use tokio::time::interval with MissedTickBehavior::Skip (like the runner's ping_task.rs:17 already does correctly).

Low: Inconsistent Error Handling

Location: Multiple places handle missing in-flight requests differently:

  • shared_state.rs:216: Returns error via .context("request not in flight")
  • shared_state.rs:257-263: Logs debug and continues
  • ping_task.rs:21: Propagates error from send_and_check_ping

Suggestion: Document whether missing in-flight requests during ping are expected or should terminate the connection.

Low: Message Index Wrapping

Location: engine/packages/pegboard-gateway/src/shared_state.rs:489-492

The wrapping_gt function is used for message index comparison with 16-bit indices. With the new ping messages being sent every 3 seconds, this adds ~20 messages/minute which is fine, but ensure high-traffic WebSocket messages don't cause issues with the wrapping logic.

Current state: Looks correct, but worth noting for future considerations.


🎯 Suggestions

1. Add Unit Tests

While integration tests exist, consider adding unit tests for:

  • Timeout detection logic
  • Wrapping message index comparison
  • Pong timestamp update logic
  • Race conditions between ping timeout and pong receipt

2. Configuration

Consider making these values configurable:

const UPDATE_PING_INTERVAL: Duration = Duration::from_secs(3);
const TUNNEL_PING_TIMEOUT: i64 = util::duration::seconds(30);

Different deployment environments may need different timeouts.

3. Metrics Enhancement

Consider adding:

  • Counter for ping timeouts
  • Gauge for current number of active tunnels
  • Histogram for time between pongs (to detect degradation)

4. Documentation

Add comments explaining:

  • Why TUNNEL_PING_TIMEOUT is 30s vs UPDATE_PING_INTERVAL is 3s (10x ratio)
  • The ping-pong flow between gateway and runner
  • When hibernation is triggered vs timeout

5. TypeScript Stringify Helper

The new stringify.ts file is well-structured, but it's purely for debugging. Consider:

  • Adding JSDoc comments about when to use these helpers
  • Documenting that this is for logging only (not serialization)

🔒 Security Considerations

No security issues identified

  • No injection vulnerabilities
  • Timestamps handled correctly
  • No sensitive data in ping/pong messages
  • Proper timeout enforcement

⚡ Performance Considerations

Acceptable overhead:

  • Ping every 3s is reasonable for most use cases
  • Metric recording is async and shouldn't block
  • Message serialization is minimal overhead

Potential optimization:

  • The scc::HashMap lookups for in-flight requests happen on every ping. For high-scale deployments with 10k+ concurrent connections, consider batching or optimizing lookups.

📊 Test Coverage

Concern: No new tests added specifically for ping-pong functionality.

Recommendation:
Add tests covering:

  1. Normal ping-pong flow
  2. Timeout detection when pong not received
  3. RTT calculation accuracy
  4. Behavior when gateway/runner disconnects during ping
  5. Message index wrapping edge cases

The minor change to driver-test-suite/utils.ts appears to be just adding the useRealTimers flag, which is good for testing timing-sensitive code.


📝 Summary

Verdict: Approve with minor concerns to address

This is a solid implementation of ping-pong heartbeats. The code quality is good, and the architectural changes improve maintainability. However, there are a few timing-related issues that should be addressed:

  1. Critical: Verify the timeout logic doesn't have a race condition
  2. Medium: Fix timer placement in gateway ping_task
  3. Low: Add tests for the new functionality

The changes follow the repository's conventions well (Rust formatting, error handling patterns, logging style).


🔧 Suggested Changes

Minimal changes to merge:

// In ping_task.rs, restructure the loop:
pub async fn task(...) -> Result<LifecycleResult> {
    let mut interval = tokio::time::interval(UPDATE_PING_INTERVAL);
    interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
    
    loop {
        tokio::select! {
            _ = interval.tick() => {}
            _ = ping_abort_rx.changed() => {
                return Ok(LifecycleResult::Aborted);
            }
        }
        
        shared_state.send_and_check_ping(request_id).await?;
    }
}

Recommended for follow-up:

  • Add comprehensive tests
  • Make timeouts configurable
  • Add additional metrics
  • Document the ping-pong protocol flow

Great work overall! 🚀

Generated by Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Tunnel Ping/Pong Implementation

Overview

This PR implements a ping/pong mechanism for the tunnel communication between the pegboard gateway and runners. This is a critical reliability feature that helps detect and handle stale connections.


✅ Strengths

Architecture & Design:

  • Clean separation of concerns with dedicated task modules (ping_task.rs, tunnel_to_ws_task.rs, ws_to_tunnel_task.rs)
  • Proper use of tokio::select! for concurrent task management
  • Well-structured protocol changes in the BARE schema

Observability:

  • Added metrics tracking for RTT (rivet_gateway_tunnel_ping_duration)
  • Good use of structured logging with tracing throughout
  • Proper use of #[tracing::instrument] attributes

Error Handling:

  • Timeout mechanism properly implemented (30s timeout)
  • Graceful handling of missing in-flight requests
  • Proper use of anyhow::Result throughout

🔍 Issues & Concerns

Critical: Potential Race Condition in Pong Handling

// engine/packages/pegboard-gateway/src/shared_state.rs:255-270
Ok(protocol::ToGateway::ToGatewayPong(pong)) => {
    let Some(mut in_flight) =
        self.in_flight_requests.get_async(&pong.request_id).await
    else {
        tracing::debug\!(
            request_id=?tunnel_id::request_id_to_string(&pong.request_id),
            "in flight has already been disconnected, dropping ping"
        );
        continue;
    };

    let now = util::timestamp::now();
    in_flight.last_pong = now;  // ⚠️ Mutation on guard object

    let rtt = now.saturating_sub(pong.ts);
    metrics::TUNNEL_PING_DURATION.record(rtt as f64 * 0.001, &[]);
}

Issue: You're mutating in_flight.last_pong through the guard returned by get_async(). Depending on the scc::HashMap implementation, this mutation may not be visible to other tasks or may not persist after the guard is dropped.

Recommendation: Use update_async() or similar method to ensure the mutation is properly committed:

self.in_flight_requests
    .update_async(&pong.request_id, |_, mut req| {
        req.last_pong = util::timestamp::now();
        req
    })
    .await;

Medium: Timeout Check Timing Issue

// engine/packages/pegboard-gateway/src/shared_state.rs:221-224
if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
    tracing::warn\!("tunnel timeout");
    return Err(WebSocketServiceTimeout.build());
}

Issue: The timeout check happens BEFORE sending the ping. If a pong is delayed but arrives before the next ping cycle, you could timeout prematurely.

Recommendation: Consider checking timeout AFTER a failed pong receive or use a separate timeout task.

Medium: Metrics Unit Mismatch

// engine/packages/pegboard-gateway/src/shared_state.rs:270
metrics::TUNNEL_PING_DURATION.record(rtt as f64 * 0.001, &[]);

Issue: Converting milliseconds to seconds (* 0.001) but the metric description says "RTT of messages" without specifying units. The histogram boundaries (BUCKETS) are likely in seconds, but this should be explicit.

Recommendation: Update the metric description in metrics.rs:

.with_description("RTT of messages from gateway to pegboard in seconds.")

Low: Inconsistent Error Logging

// engine/packages/pegboard-gateway/src/shared_state.rs:222
tracing::warn\!("tunnel timeout");

Issue: Missing structured fields (request_id, last_pong timestamp, etc.) that would help debugging.

Recommendation:

tracing::warn\!(
    request_id=?tunnel_id::request_id_to_string(&request_id),
    last_pong=?req.last_pong,
    elapsed=?now.saturating_sub(req.last_pong),
    "tunnel timeout"
);

Low: Magic Number in Metrics

// Line 270: * 0.001

Recommendation: Use a named constant for unit conversion clarity.


🧪 Test Coverage Concerns

Major Gap: No tests found for the ping/pong functionality. The only WebSocket test (simple_websocket_test2.rs) doesn't cover:

  • Ping timeout scenarios
  • RTT calculation correctness
  • Pong handling with missing requests
  • Concurrent ping/pong under load

Recommendation: Add integration tests covering:

  1. Normal ping/pong cycle
  2. Timeout when pongs stop arriving
  3. Late pong arrival handling
  4. Metrics collection verification

📊 Performance Considerations

Positive:

  • 3-second ping interval is reasonable
  • Using saturating_sub prevents overflow
  • Atomic operations for RTT storage in runner

Concern:

  • Every 3 seconds, all active tunnels send pings. Under high connection count, this could create periodic load spikes. Consider:
    • Jittering ping intervals slightly
    • Batching metrics recording

🔒 Security Considerations

Good:

  • Timeout prevents resource exhaustion from stale connections
  • No user-controlled data in ping/pong messages

Note: 30-second timeout is quite generous. Consider if 15-20s would be sufficient given the 3s ping interval.


🎨 Code Quality

Strengths:

  • Follows repository conventions (hard tabs, error handling patterns)
  • Good use of workspace dependencies
  • Proper import organization

Issues:

  1. Import ordering: In guard-core/src/custom_serve.rs, the import of pegboard::tunnel::id::RequestId was moved but could be better grouped:

    // Before (line 11-12)
    use pegboard::tunnel::id::RequestId;
    
    // After (line 7)
    use pegboard::tunnel::id::RequestId;

    The change looks fine, just ensure it follows local conventions.

  2. Dead code annotations: In shared_state.rs:444-445, you added #[allow(dead_code)] for first_msg_index and last_msg_index. These are used in debug formatting, so the annotation is correct, but consider using them in the warning log for better debugging.


📝 Documentation Gaps

  1. No code comments explaining the ping/pong protocol flow
  2. Missing documentation on why 30s timeout was chosen
  3. The UPDATE_PING_INTERVAL constant has no doc comment
  4. No update to any README or architecture docs about this new subsystem

Recommendation: Add module-level documentation:

//\! Ping/Pong health monitoring for tunnel connections.
//\!
//\! This module implements a health check mechanism that:
//\! - Sends pings every 3 seconds from gateway to runner
//\! - Expects pongs within 30 seconds
//\! - Tracks RTT for observability
//\! - Terminates stale connections

🔄 TypeScript Changes

The TypeScript stringify utilities look good:

  • Comprehensive logging support
  • Handles all protocol message types
  • No security issues with the logging

Minor: The stringify.ts file is quite large (336 lines). Consider splitting into multiple files by protocol section in a future refactor.


✨ Recommendations Summary

Must Fix (before merge):

  1. ✅ Verify and fix the last_pong mutation pattern with scc::HashMap
  2. ✅ Add basic integration test for ping/pong timeout

Should Fix:

  1. Improve timeout check logic
  2. Clarify metrics units in description
  3. Add structured logging fields to timeout warning

Nice to Have:

  1. Add module-level documentation
  2. Consider jittering ping intervals under high load
  3. Use structured fields in more log messages
  4. Review 30s timeout duration

🎯 Overall Assessment

This is a well-structured implementation of an important reliability feature. The code follows repository patterns and demonstrates good engineering practices. However, the potential race condition in pong handling is a blocker that needs verification, and the lack of tests is a significant gap.

Recommendation: Request changes for the critical issue and test coverage before merging.

Great work on the clean task separation and observability! 🚀

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review

Summary

This PR implements a ping-pong mechanism for WebSocket tunnels between the gateway and pegboard runners. The implementation replaces the previous keep-alive system with an active health-check mechanism that measures round-trip time (RTT) and detects connection timeouts.


Code Quality & Best Practices

Strong Points:

  • Good separation of concerns: The refactoring splits the monolithic WebSocket handling logic into focused task modules (tunnel_to_ws_task, ws_to_tunnel_task, ping_task)
  • Proper structured logging: Uses tracing structured logging correctly
  • Metrics instrumentation: Added proper observability with TUNNEL_PING_DURATION histogram
  • Follows repository conventions: Uses workspace dependencies, hard tabs, lowercase log messages as per CLAUDE.md

Recommendations:

  1. Protocol versioning: The protocol changes (adding ToRunnerPing/ToGatewayPong) appear to maintain backward compatibility by using union types, which is good. However, consider documenting the minimum protocol version required for ping-pong support.

  2. Import organization (engine/packages/guard-core/src/custom_serve.rs:4-10): The import reordering is minimal but appears unnecessary. Consider keeping imports stable unless fixing issues.


Potential Issues

Critical:

  1. Ping timeout behavior (engine/packages/pegboard-gateway/src/shared_state.rs:220):

    • The timeout check uses saturating_sub, which is good for avoiding underflow
    • However, if now < req.last_pong due to clock skew, the timeout will never trigger
    • Recommendation: Consider using monotonic time (e.g., Instant) instead of wall-clock timestamps for timeout calculations
  2. Race condition in abort logic (engine/packages/pegboard-gateway/src/lib.rs:434-467):

    • Multiple abort signals sent from different tasks could lead to ordering issues
    • The let _ pattern silently ignores send failures
    • Recommendation: Log if abort signals fail to send, as this could indicate the channel was already closed

Moderate:

  1. Error handling in GC (engine/packages/pegboard-gateway/src/shared_state.rs:486-487):

    • New first_msg_index and last_msg_index fields are marked with allow(dead_code)
    • These appear to be for debugging but aren't logged
    • Recommendation: Either log these values in the GC warning or remove if not needed
  2. Ping interval vs timeout relationship:

    • UPDATE_PING_INTERVAL = 3 seconds
    • TUNNEL_PING_TIMEOUT = 30 seconds
    • This means up to 10 missed pings before timeout, which seems reasonable
    • Recommendation: Document this relationship with a comment explaining the tolerance

Security Concerns

Low Risk:

  1. No message size limits on ping/pong: The ping/pong messages only contain small fixed-size data (gateway_id, request_id, timestamp), so no risk of DoS through oversized messages.

  2. Timestamp injection: The ts field is controlled by the sender and echoed back. While used for RTT calculation, it doesn't affect security-critical decisions.


Performance Considerations

Positive:

  • Ping interval of 3 seconds strikes a good balance between responsiveness and overhead
  • Using lazy_static for metrics is appropriate and efficient

Concerns:

  1. Metric cardinality: TUNNEL_PING_DURATION.record() with no attributes is good (avoids cardinality explosion)

  2. Lock contention: The SharedState uses scc::HashMap which is lock-free, good choice for concurrent access patterns


Test Coverage

Missing:

  • No test files included in this PR
  • Critical missing tests:
    1. Ping timeout behavior - verify connection closes after 30 seconds without pong
    2. RTT calculation accuracy - verify timestamps are handled correctly
    3. Task abort coordination - ensure all 3 tasks properly clean up
    4. Clock skew scenarios - test behavior with backwards time

Recommendation: Add integration tests in a follow-up PR or before merge


Additional Recommendations

  1. TypeScript SDK changes (engine/sdks/typescript/runner/src/stringify.ts): A new 151-line file was added. Consider reviewing if all stringify logic is necessary or if some can use existing serialization.

  2. Documentation: Consider adding inline documentation for the ping-pong mechanism flow explaining the 4-step process.

  3. Metrics dashboards: Ensure the new rivet_gateway_tunnel_ping_duration metric is added to relevant Grafana dashboards for monitoring RTT trends.


Summary

Overall Assessment: Approve with recommendations

This is a solid implementation of the ping-pong mechanism. The refactoring improves code organization and the health-check mechanism adds important reliability features. The main concerns are:

  1. Clock skew handling for timeout calculations
  2. Missing test coverage
  3. Minor error handling improvements

The code follows repository conventions well and the architecture is sound. Recommend addressing the clock skew issue before merge, and adding tests in a follow-up PR.

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 40a9f30 to d287067 Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from 9d2a169 to e98d4d0 Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: Implement Ping-Pong for Tunnel Protocol

Summary

This PR implements a ping-pong mechanism for the WebSocket tunnel protocol between the pegboard gateway and runner, replacing the previous keep-alive system. The implementation adds proper timeout detection and RTT (round-trip time) metrics collection.

✅ Strengths

1. Clean Architecture & Code Organization

  • Excellent refactoring: monolithic WebSocket handler logic split into focused task modules (tunnel_to_ws_task.rs, ws_to_tunnel_task.rs, ping_task.rs)
  • Separation of concerns makes the code much more maintainable and testable
  • Consistent patterns between gateway and runner implementations

2. Robust Ping-Pong Implementation

  • Gateway Side (pegboard-gateway/src/ping_task.rs):

    • Sends pings every 3 seconds with timestamp
    • Checks for timeout (30 seconds) before each ping
    • Returns appropriate errors on timeout
  • Runner Side (pegboard-runner/src/tunnel_to_ws_task.rs):

    • Immediately responds with pong containing original timestamp
    • Pong published to UPS without blocking WebSocket traffic
    • Correctly uses continue to avoid forwarding ping to client
  • State Tracking (shared_state.rs):

    • last_pong timestamp tracked per in-flight request
    • Pong handler updates timestamp and records RTT metrics
    • Clean integration with existing request lifecycle

3. Metrics & Observability

  • New metric TUNNEL_PING_DURATION captures RTT in seconds
  • Proper use of OpenTelemetry histogram with standard buckets
  • Structured logging with tracing::instrument macros throughout
  • Good log levels (debug for normal flow, warn for timeouts)

4. Protocol Updates

  • Protocol v3 correctly defines new message types:
    • ToRunnerPing with gateway_id, request_id, ts
    • ToGatewayPong with request_id, ts
  • Deprecates old ToRunnerKeepAlive / ToGatewayKeepAlive (replaced in union)
  • TypeScript SDK includes stringify helpers for debugging

5. Task Coordination

  • Proper use of tokio::select! for concurrent task management
  • Three independent tasks (tunnel_to_ws, ws_to_tunnel, ping) with abort channels
  • Clean shutdown logic: non-aborted tasks abort others on completion
  • Result prioritization logic handles edge cases well (lines 456-464 in lib.rs)

🔍 Issues & Concerns

CRITICAL: Potential Race Condition in Timeout Check

Location: pegboard-gateway/src/shared_state.rs:211-224

pub async fn send_and_check_ping(&self, request_id: RequestId) -> Result<()> {
    let req = self.in_flight_requests.get_async(&request_id).await
        .context("request not in flight")?;
    
    let now = util::timestamp::now();
    
    // Verify ping timeout
    if now.saturating_sub(req.last_pong) > TUNNEL_PING_TIMEOUT {
        tracing::warn!("tunnel timeout");
        return Err(WebSocketServiceTimeout.build());
    }
    // ... send ping
}

Problem: The last_pong check happens before sending the ping, but the 30-second timeout may not account for the time it takes to actually send the message. If there's network delay or UPS publishing delay, you could timeout right after sending a ping, even if a pong is about to arrive.

Recommendation: Consider checking timeout after successful ping send, or use a slightly longer timeout window (e.g., TUNNEL_PING_TIMEOUT + UPDATE_PING_INTERVAL).

MODERATE: Timing Constants Analysis

Constants:

  • UPDATE_PING_INTERVAL = 3 seconds (how often pings are sent)
  • TUNNEL_PING_TIMEOUT = 30 seconds (max time without pong)

Analysis:

  • 30 seconds / 3 seconds = 10 missed pings before timeout
  • This is reasonable for network resilience
  • However, the ratio could be made explicit via a constant like MAX_MISSED_PINGS = 10

Minor Issue: Gateway and runner both define UPDATE_PING_INTERVAL separately. Consider moving to a shared constant or protocol config.

MINOR: Import Organization

Several files have imports that could be better organized per CLAUDE.md guidelines:

Example: pegboard-gateway/src/lib.rs:11-20

use rivet_guard_core::{
    WebSocketHandle,
    custom_serve::{CustomServeTrait, HibernationResult},
    errors::{ServiceUnavailable, WebSocketServiceUnavailable},
    proxy_service::{ResponseBody, is_ws_hibernate},
    request_context::RequestContext,
    websocket_handle::WebSocketReceiver,
};

The imports are grouped correctly, but WebSocketHandle is at the top while websocket_handle::WebSocketReceiver is at the bottom. For consistency, related types should be adjacent.

MINOR: Inconsistent Error Handling

Location: pegboard-runner/src/ping_task.rs:27-35

let Some(wf) = ctx.workflow::<pegboard::workflows::runner::Input>(conn.workflow_id)
    .get().await?
else {
    tracing::error!(?conn.runner_id, "workflow does not exist");
    return Ok(()); // <-- Returns Ok despite error
};

Returning Ok(()) when workflow doesn't exist silently swallows the error. While this may be intentional (to keep ping task alive), it would be clearer to:

  1. Add a comment explaining why this returns Ok
  2. Or use a custom error type that indicates "skip this ping"

MINOR: Missing Tracing Fields

The ping task at pegboard-gateway/src/ping_task.rs:8-23 lacks a #[tracing::instrument] attribute. While the called function send_and_check_ping has instrumentation, adding it to the task would provide better span context.

🧪 Testing Concerns

No Tests Found:

  • No unit tests for ping-pong logic
  • No integration tests for timeout scenarios
  • No tests for race conditions (ping sent right before pong received)

Recommendations:

  1. Add unit tests for:
    • Timeout detection with mocked timestamps
    • RTT calculation accuracy
    • Proper abort signal handling
  2. Add integration tests for:
    • Normal ping-pong flow
    • Timeout after N missed pings
    • Recovery after temporary network issues
    • Concurrent ping while processing other messages

🔒 Security Considerations

No Issues Identified - The implementation properly validates message IDs and uses existing authentication/authorization mechanisms. The ping-pong mechanism doesn't introduce new attack vectors.

⚡ Performance Considerations

Positive:

  • Ping interval of 3 seconds is reasonable (not too chatty)
  • Pongs published via UPS without blocking main message flow
  • Metrics collection is efficient (histogram recording is O(1))

Potential Concern:

  • Gateway side pings all active connections every 3 seconds
  • For high connection counts (e.g., 10K+ concurrent WebSockets), this could generate significant UPS traffic
  • Recommendation: Consider adding metrics to track ping task overhead in production

📝 Code Style & Convention Adherence

Good:

  • ✅ Follows hard tabs per rustfmt.toml
  • ✅ Proper use of structured logging (no string interpolation in log messages)
  • ✅ Lowercase log messages (e.g., "tunnel timeout")
  • ✅ Uses workspace dependencies correctly
  • ✅ Follows existing error handling patterns with RivetError derive macro

Minor Issues:

  • Import organization could be slightly improved (noted above)
  • Missing some #[tracing::instrument] attributes for consistency

🎯 Recommendations

High Priority

  1. Review timeout logic to prevent race condition between ping send and timeout check
  2. Add tests for ping-pong functionality, especially timeout scenarios

Medium Priority

  1. Consider making UPDATE_PING_INTERVAL a protocol-level constant shared between gateway and runner
  2. Add explicit constant MAX_MISSED_PINGS to make timeout relationship clear
  3. Document error handling strategy in ping_task.rs when workflow doesn't exist

Low Priority

  1. Add #[tracing::instrument] to ping task functions for complete observability
  2. Organize imports more consistently
  3. Consider monitoring ping overhead metrics in production

📊 Overall Assessment

Verdict: ✅ Approve with Minor Concerns

This is a solid implementation that significantly improves tunnel reliability. The code quality is high, the architecture is clean, and the feature works as intended. The critical issue with timeout timing is worth addressing, but it's unlikely to cause problems in practice given the generous timeout window. The lack of tests is concerning for such critical infrastructure code, but this can be addressed in a follow-up.

Great work overall! The refactoring to separate task modules is particularly well done and will make future maintenance much easier.


Files Reviewed: 27 files, 674 additions, 290 deletions
Key Files: pegboard-gateway/src/lib.rs, shared_state.rs, ping_task.rs, tunnel_to_ws_task.rs, protocol schemas

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: Tunnel Ping/Pong Implementation

Thank you for implementing this important health monitoring feature! Overall, this is a solid implementation that adds connection health checks between the gateway and pegboard runner.


✅ Strengths

  1. Good Code Organization: Breaking the monolithic websocket handler into separate task modules improves readability and maintainability.

  2. Protocol Extension: The new ToRunnerPing and ToGatewayPong messages are well-designed with timestamps for RTT calculation.

  3. Metrics Integration: Adding the TUNNEL_PING_DURATION histogram for observability is excellent.

  4. Consistent Error Handling: The ping timeout check properly returns WebSocketServiceTimeout error.

  5. Task Coordination: The abort mechanism using watch channels to coordinate the three concurrent tasks is well-implemented.


🔍 Issues & Concerns

Critical

1. Race Condition in Ping Timeout Check (shared_state.rs:210-224)

The req guard is dropped after reading last_pong but before sending the ping. If a pong arrives between the timeout check and the ping send, the check could incorrectly timeout.

Recommendation: Hold the lock through both operations or use atomic operations.

2. First Ping Timing Issue (shared_state.rs:119)

last_pong is initialized to current time, but the first ping won't be sent until 3 seconds later. If connection takes longer than 30 seconds to establish, the first ping could incorrectly trigger a timeout.

Recommendation: Send an immediate ping after connection establishment or adjust the initial last_pong.

High Priority

3. Silent Ping Failures (ping_task.rs:21)

The ping task propagates errors without logging them before termination.

Recommendation: Add error logging before propagating.

4. Dropped Pongs Not Tracked (shared_state.rs:255-264)

When a pong arrives for a non-existent request, it's silently dropped. This could indicate bugs or attacks.

Recommendation: Add a counter metric for dropped pongs.

Medium Priority

5. Hardcoded Ping Interval: Consider making UPDATE_PING_INTERVAL configurable.

6. Metric Description: Update to clarify this is RTT, not one-way latency.

7. Missing Request ID in Logs: Ping task doesn't include request_id in tracing span.


💡 Code Quality

✅ Uses tracing with structured logging
✅ Proper tokio::select! usage
✅ Consistent error handling
✅ Good separation of concerns
⚠️ Minor: Cargo.toml dependency ordering


🔒 Security

No major security issues identified. The ping/pong mechanism is internal and doesn't expose new attack surfaces.


⚡ Performance

  • 3-second ping interval is reasonable
  • Watch channels are efficient
  • 30-second timeout seems reasonable for detecting stale connections
  • Minor lock contention possible under high load but acceptable

📝 Test Coverage

Missing: No unit tests found for the new ping/pong functionality.

Recommendation: Add integration tests for normal flow, timeout detection, race conditions, and metrics.


🎯 Summary

This is a solid implementation with good code quality. Main concerns:

  1. Critical: Race condition in ping timeout check (fix before merge)
  2. Critical: First ping timing issue (fix before merge)
  3. High: Missing error logging (should fix)
  4. Missing: Test coverage (should add)

Estimated risk level: Medium (with fixes, Low)

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025
@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-19-fix_tunnel_implement_ping_pong branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants