Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 25, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 26, 2025 2:20am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 26, 2025 2:20am
rivet-inspector Ignored Ignored Preview Nov 26, 2025 2:20am
rivet-site Ignored Ignored Preview Nov 26, 2025 2:20am

Copy link
Contributor Author

MasterPtato commented Nov 25, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 25, 2025

PR Review: Implement events and commands for actor workflows

I've reviewed this PR which implements MK2 protocol support for actor workflows. Here are my findings:

Strengths

  1. Well-structured protocol versioning with clear upgrade/downgrade paths
  2. Robust event demuxing with batching and proper cleanup
  3. Careful initialization sequence that handles missed commands
  4. Good per-runner state management with separate event/command indices

Issues & Concerns

1. Channel abort TODO (actor_event_demuxer.rs:81) - HIGH PRIORITY
The TODO about aborting tasks needs addressing. Aborting may lose in-flight events. Consider graceful shutdown or document why abort is safe.

2. Metadata parsing (conn.rs:285-288, 299-303) - MEDIUM PRIORITY
Silent failures via unwrap_or_default() may hide malformed metadata. Add logging for parse failures.

3. Event deduplication (workflows/actor/mod.rs:466-477)
Consider edge cases: What if events arrive out of order? Risk of loss if runner crashes mid-send?

4. Runner state cleanup (workflows/actor/mod.rs:623-625)
Ensure no race condition where events from removed runner recreate state entry.

5. Missing command ack tracking
Event acks are batched but no similar tracking for command acks. Intentional?

Security

  • Actor ownership validation: Good ✓
  • Generation checking: Proper filtering ✓

Performance

  • Good event batching with 1024 buffer
  • Proper transaction isolation levels
  • Consider: Loop processes up to 1024 signals - may need periodic yielding

Test Coverage

Consider adding:

  • Event deduplication tests
  • Missed command recovery tests
  • Multi-runner generation change tests

Recommendations

  1. HIGH: Address abort TODO or document safety
  2. MEDIUM: Add metadata parse failure logging
  3. MEDIUM: Add reconnection integration tests

Overall

Solid implementation with good architecture. Main concern is the task abort TODO.

Recommendation: Approve with minor changes

@MasterPtato MasterPtato force-pushed the 11-24-fix_pb_implement_events_and_commands_for_actor_wfs branch from 88e1336 to 52845b6 Compare November 26, 2025 02:16
@MasterPtato MasterPtato force-pushed the 11-21-fix_make_all_uses_of_protocol_handle_v4_-_v3 branch from 7ddbf54 to 72a5b39 Compare November 26, 2025 02:16
@MasterPtato MasterPtato force-pushed the 11-24-fix_pb_implement_events_and_commands_for_actor_wfs branch from 52845b6 to 0d4625a Compare November 26, 2025 02:18
@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 26, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3543

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3543

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3543

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3543

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3543

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3543

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3543

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3543

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3543

commit: 0d4625a

@claude
Copy link

claude bot commented Nov 26, 2025

Pull Request Review: fix(pb): implement events and commands for actor wfs

Overview

This PR implements events and commands handling for actor workflows in the Pegboard system, specifically adding support for the mk2 protocol version. The changes span across multiple packages including pegboard-runner, pegboard, pegboard-gateway, and related infrastructure.

Code Quality & Architecture

Strengths:

  1. Good abstraction for protocol versions via the Init enum in conn.rs (lines 203-255)
  2. Event batching properly reduces overhead using EVENT_ACK_BATCH_SIZE = 250
  3. Proper state management with RunnerState tracking for multiple runners
  4. Database transaction isolation with custom instrumentation

Potential Bugs & Issues

  1. Missing error handling in metadata parsing (conn.rs:285-288, 299-303): Using unwrap_or_default() silently swallows parsing errors. Consider logging when metadata parsing fails.

  2. Unchecked index arithmetic (actor/mod.rs:468-469): The filtering assumes monotonically increasing indices. If the runner sends duplicate or out-of-order events, this could cause issues.

  3. Potential race condition (conn.rs:88-117): Between checking if a runner exists/is expired and dispatching the workflow, another connection could modify the runner state. Consider documenting this.

  4. Missing protocol version validation (pegboard-gateway/lib.rs:164-174): The code defaults to PROTOCOL_MK1_VERSION when no version is stored, but this could cause issues if a runner reconnects with a different version.

@claude
Copy link

claude bot commented Nov 26, 2025

Performance Considerations

  1. Database read amplification (conn.rs:324-356): Reading all missed commands during init could be expensive for runners offline for long periods. Consider pagination, command expiration, and metrics.

  2. Iterator chaining efficiency (keys/runner.rs:804-806): Nested iterators could be optimized using flat_map.

  3. Event filtering on every signal (actor/mod.rs:464-473): For high-frequency events, consider maintaining a cursor or checkpointing more aggressively.

Security Concerns

  1. No metadata size validation (conn.rs:297-316): A malicious runner could send excessive metadata. Add size validation before chunking.

  2. Command replay vulnerability (conn.rs:323-356): Missed commands are replayed without verifying relevance. Consider adding command expiration.

Test Coverage

Missing test scenarios:

  • No tests for handle_init with prepopulated actor names
  • Missing event deduplication logic tests
  • No metadata chunking/splitting tests
  • Missing protocol version negotiation edge case tests

Database Schema Concerns

Command persistence: Commands are persisted but there is no cleanup logic. This could lead to unbounded growth over time. Add a cleanup mechanism for acknowledged/expired commands.

@claude
Copy link

claude bot commented Nov 26, 2025

Style & Convention Issues

  1. Import ordering (conn.rs:7-8): Should combine duplicate futures_util imports per CLAUDE.md conventions
  2. Formatting inconsistencies: Auto-formatting changes should be in a separate commit

Recommendations

High Priority:

  1. Add size validation for metadata to prevent abuse
  2. Implement command cleanup/expiration logic
  3. Add logging for metadata parsing errors
  4. Add tests for the new event handling logic

Medium Priority:

  1. Document the protocol version fallback behavior
  2. Add metrics for missed command counts and event batch sizes
  3. Optimize iterator chaining in metadata key combining
  4. Consider pagination for missed commands

Low Priority:

  1. Separate formatting changes into their own commit
  2. Combine duplicate imports
  3. Document the EVENT_ACK_BATCH_SIZE constant

Overall Assessment

This is a solid implementation that handles complex protocol versioning and event management. The code is generally well-structured with proper error handling. However, there are areas that need attention around data validation, performance at scale, and test coverage. The main concerns are unbounded growth (metadata size, command storage) and potential edge cases in event deduplication.

Recommendation: Request changes to address high-priority issues, particularly around data validation and resource management.

@claude
Copy link

claude bot commented Nov 26, 2025

PR Review: Implement Events and Commands for Actor Workflows

This PR implements a significant refactoring of the actor lifecycle management system to use persistent events and commands for mk2 protocol runners. Here's my comprehensive review:

✅ Strengths

  1. Protocol Versioning: Good backward compatibility with mk1 protocol while introducing mk2 improvements

    • Clean separation between mk1 and mk2 handling in conn.rs:181-198
    • Proper fallback to PROTOCOL_MK1_VERSION when version is missing
  2. Event Batching: Efficient event acknowledgment with batching (lines actor/mod.rs:586-606)

    • Uses EVENT_ACK_BATCH_SIZE = 250 to reduce message overhead
    • Only sends acks when threshold is exceeded
  3. State Management: Well-structured state tracking per runner

    • RunnerState tracks both last_event_idx and last_event_ack_idx
    • LifecycleRunnerState maintains proper event sequencing
  4. Actor Event Demuxer: Clean separation of concerns

    • Batches events efficiently (actor_event_demuxer.rs:52)
    • Includes proper garbage collection (actor_event_demuxer.rs:76-89)

⚠️ Issues & Concerns

Critical

  1. Missing Error Handling in Connection Init (conn.rs:70)
    If init parsing fails, the connection is dropped without sending a proper error response to the runner. Consider sending an error message before closing.

  2. Silent Event Send Failures (actor_event_demuxer.rs:41, 63)
    Events are silently dropped if the channel is full or closed with underscore pattern. This could lead to event loss without any warning. Consider:

    • Logging when sends fail
    • Implementing backpressure or flow control
    • Alerting when events are dropped
  3. Unsafe Abort in GC (actor_event_demuxer.rs:84)
    There's a TODO comment indicating uncertainty about task abortion safety. Aborting tasks can leave state inconsistent. Consider:

    • Sending a shutdown signal to the task instead
    • Waiting for graceful completion with a timeout
    • Documenting why abort is safe if it is
  4. Race Condition in Protocol Version (pegboard-gateway/lib.rs:165-176)
    The protocol version is read from UDB but uses read_opt with a fallback. If a runner reconnects between mk1 and mk2, there's a small window where the version might be stale or missing.

Medium Priority

  1. Transaction Isolation Inconsistency (runtime.rs:244)
    Mixing Snapshot and Serializable reads in the same transaction could lead to subtle consistency issues. The comment explains the intent but the safety should be verified.

  2. Unbounded Command Index (runtime.rs:1105)
    The command index grows unboundedly (last_command_idx += input.commands.len()). Consider:

    • What happens when it overflows i64::MAX?
    • Implementing command cleanup/archival
  3. Metadata Deserialization (conn.rs:285-288, 300-303)
    Invalid JSON in metadata silently becomes empty with unwrap_or_default(). Consider logging when deserialization fails for debugging.

  4. Missing Context in Trace (actor_event_demuxer.rs:38)
    Good logging added, but the ingest method isn't instrumented. Consider adding tracing::instrument for better traceability.

Low Priority

  1. Code Formatting: The PR includes unrelated formatting changes in api-builder/middleware.rs:189. This is minor but makes the diff harder to review.

  2. Commented-Out Code: Comment formatting change in metrics/buckets.rs:5 seems unnecessary.

📊 Performance Considerations

  1. Event Batching: Good use of batching in actor_event_demuxer.rs:52 with buffer size 1024
  2. Database Queries: The StreamingMode::WantAll in conn.rs:326 could be expensive if there are many missed commands. Consider pagination.
  3. Command Storage: Commands are stored indefinitely in UDB. Consider implementing cleanup based on acknowledgment.

🔒 Security Considerations

  1. Metadata Injection: The prepopulate_actor_names and metadata fields allow runners to write arbitrary data to UDB. Ensure proper validation exists upstream.
  2. Command Replay: Missed commands are replayed on reconnect (conn.rs:377-385). Verify that commands are idempotent to prevent double-execution issues.

🧪 Testing Recommendations

  1. Add tests for event deduplication logic (actor/mod.rs:466-470)
  2. Test command index overflow scenarios
  3. Test event loss scenarios (what happens if demuxer drops events?)
  4. Test mk1 ↔ mk2 protocol migration
  5. Test runner reconnection with missed commands

📝 Documentation Gaps

  1. The new ActorCommandKey structure needs documentation explaining the indexing scheme
  2. The relationship between last_event_idx, last_event_ack_idx, and last_command_idx should be documented
  3. Add comments explaining why certain transactions use Snapshot vs Serializable

✨ Suggestions

  1. Extract Constants: Magic numbers like 1024 for batch sizes should be constants
  2. Metrics: Add metrics for:
    • Events dropped by demuxer
    • Command replay counts
    • Event processing latency
  3. Graceful Degradation: Consider what happens when UDB is slow/unavailable during init

Overall Assessment

This is a substantial and well-architected change that modernizes the actor lifecycle system. The core logic is sound, but there are several areas where error handling and edge cases need attention. The critical issues around silent event dropping and task abortion should be addressed before merging.

Recommendation: Request changes to address critical issues, then approve after fixes.

Files needing attention:

  • engine/packages/pegboard-runner/src/actor_event_demuxer.rs
  • engine/packages/pegboard-runner/src/conn.rs
  • engine/packages/pegboard/src/workflows/actor/runtime.rs

@claude
Copy link

claude bot commented Nov 26, 2025

PR Review: fix(pb): implement events and commands for actor wfs

I've completed a comprehensive review of this PR. This is a significant refactor implementing mk2 protocol support for actor workflows. Overall, the implementation is solid, but I've identified several areas that need attention.

Critical Issues

1. Missing Error Propagation (pegboard-gateway/src/lib.rs:210-213)
The result from send().await? is now being checked, but there's no error handling if the workflow isn't found. Add proper error handling or document why graceful_not_found() is acceptable.

2. Unsafe unwrap_or_default() in Protocol Version (pegboard-gateway/src/lib.rs:172-173)
Silently defaults to MK1 protocol if version isn't found. Log when defaulting to MK1 or consider failing for critical operations.

3. Race Condition in Init Handler (pegboard-runner/src/conn.rs:257-272)
Signal is sent before database writes complete. Workflow may not see prepopulated actor names or metadata. Complete DB writes before sending signal or add synchronization.

4. Silent JSON Parse Failures (pegboard-runner/src/conn.rs:285-288, 299-303)
JSON parsing errors are silently ignored with empty defaults. Add logging for parse failures.

High Priority Issues

5. TODO Comments Need Resolution
Multiple TODOs in pegboard-runner/src/ws_to_tunnel_task.rs - KV operations may be blocking the WebSocket receive loop. Create follow-up issues or resolve in this PR.

6. Missing Tracing (pegboard/src/workflows/actor/runtime.rs:1096-1163)
insert_and_send_commands activity needs structured logging for debugging.

7. HACK Comment (pegboard/src/workflows/actor/runtime.rs:573-574)
Using util::timestamp::now() for create_ts breaks workflow determinism. Use ctx.create_ts() instead.

Medium Priority Issues

8. Unbounded Channels (pegboard-runner/src/actor_event_demuxer.rs:29)
Could cause memory issues. Consider bounded channels with backpressure.

9. Hardcoded EVENT_ACK_BATCH_SIZE
Should this be configurable per deployment?

10. Incomplete Runner State Cleanup
State could accumulate if check_runners fails. Add periodic cleanup or document retention policy.

Positive Observations

  • Excellent protocol abstraction with Init enum
  • Robust error context throughout
  • Proper database subspace management
  • Good structured logging (with room for improvement)
  • Careful transaction design with isolation levels

Test Coverage

Missing tests for:

  • Event demuxer
  • Command insertion and acknowledgment
  • Protocol version fallback

Summary

Overall: Approve with Changes

Priority actions before merge:

  1. Fix init handler race condition
  2. Add error logging for JSON parse failures
  3. Resolve TODO comments or create follow-up issues
  4. Fix HACK with dynamic timestamps

Follow-up issues to create:

  • KV operation performance optimization
  • Test coverage for event/command system
  • Error message sanitization

Great work on this complex refactor!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants