Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 25, 2025 2:58am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 25, 2025 2:58am
rivet-inspector Ignored Ignored Preview Nov 25, 2025 2:58am
rivet-site Ignored Ignored Preview Nov 25, 2025 2:58am

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 18, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3480

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3480

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3480

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3480

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3480

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3480

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3480

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3480

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3480

commit: 4eabd07

@claude
Copy link

claude bot commented Nov 18, 2025

Code Review: Rewrite runner workflow to handle batch signals

Summary

This PR introduces runner2.rs as a rewrite of the runner workflow to handle batch signal processing. The key improvement is switching from processing signals one-at-a-time to processing them in batches using listen_n_with_timeout instead of listen_with_timeout.

Positive Aspects

Architecture & Design

  • Batch Processing: The switch to listen_n_with_timeout with a batch size of 1024 is excellent for performance and throughput
  • State Separation: Good separation between State (persistent workflow state) and LifecycleState (loop iteration state)
  • Signal Batching: Well-designed batching logic that accumulates signals before processing (lines 94-183)
  • Message Batching: Efficient batching of outbound messages before sending (line 185)

Error Handling

  • Graceful Degradation: Proper use of .graceful_not_found() when signaling actor workflows (lines 163, 247, 362)
  • Race Condition Handling: Good handling of draining runners receiving start actor commands (lines 139-172)

Code Quality

  • Clear Comments: Good inline documentation explaining non-obvious behavior (lines 236, 819, 967, 985)
  • Consistent Patterns: Activities follow consistent naming and structure
  • Logging: Appropriate use of structured logging with tracing::warn!

Issues & Concerns

Critical Issues

1. Unused Import (line 11)

use vbare::OwnedVersionedData;

This import is not used anywhere in the file and should be removed. The project uses workspace dependencies, so unnecessary imports should be cleaned up.

Fix: Remove line 11

2. Potential Division by Zero (lines 620, 1057)

let remaining_millislots = (remaining_slots * 1000) / input.total_slots;

If input.total_slots is 0, this will panic. While this may be prevented by validation elsewhere, defensive coding suggests adding a check or assertion.

Recommendation: Add validation or use checked_div with proper error handling

3. Missing Signal Handler?

The Main signal enum at line 1170 includes CheckQueue, Command, Forward, and Stop, but I notice the batch processing loop handles these signals. However, there's no validation that all signal variants are handled - if a new signal type is added to Main, the compiler won't force an update here since the match isn't exhaustive on the enum.

Recommendation: Consider if this is intentional or if the signal handling should be refactored for better compile-time safety

Performance Considerations

4. Sequential Signal Processing (lines 239-256)

// NOTE: This should not be parallelized because signals should be sent in order
// Forward to actor workflows
// Process events
for event in &events {
    // ... sends signal to actor workflow
}

While the comment explains this must be sequential, this could become a bottleneck with many events. Each signal send is an async operation that must complete before the next begins.

Consider:

  • Is strict ordering truly required for all events, or just events for the same actor?
  • Could you batch events by actor_id and parallelize across different actors while maintaining order per-actor?

5. Sequential Allocation Signals (lines 315-321)

for alloc in res.allocations {
    ctx.signal(alloc.signal)
        .to_workflow::<crate::workflows::actor::Workflow>()
        .tag("actor_id", alloc.actor_id)
        .send()
        .await?;
}

Similar to #4, these allocations are sent sequentially but could potentially be parallelized since they're going to different actors.

Recommendation: Use futures::future::try_join_all or similar to parallelize these independent operations

6. Message-by-Message Publishing (lines 1140-1147)

for message in &input.messages {
    let message_serialized = versioned::ToClient::wrap_latest(message.clone())
        .serialize_with_embedded_version(PROTOCOL_VERSION)?;
    
    ctx.ups()?
        .publish(&receiver_subject, &message_serialized, PublishOpts::one())
        .await?;
}

Each message is published individually in a loop. If the pubsub system supports batch publishing, this could be optimized.

Recommendation: Check if ups() supports batch publishing and utilize it

Code Quality Issues

7. Commented Out Code (lines 36, 819-825)

// events: Vec<EventRow>,
// TODO: Storing events is disabled for now, otherwise state will grow indefinitely

While the TODO explains why events aren't stored, the commented code should either be removed or the decision should be finalized.

Recommendation: Either implement a proper solution (e.g., event truncation/archival) or remove the dead code

8. Deprecated Field (line 455-456)

#[derive(Debug, Serialize, Deserialize)]
struct InitOutput {
    /// Deprecated.
    evict_workflow_id: Option<Id>,
}

If this field is deprecated, there should be a plan to remove it. Is this for backwards compatibility? Should it have a timeline for removal?

Recommendation: Add context on when this can be removed or if it's needed for backward compatibility

9. Clone Overhead (line 70, 284)

let input = input.clone();  // line 70
commands: commands.clone(),  // line 284

The input is cloned on every loop iteration. While Input is relatively small, commands could grow large and is cloned even when it might not be needed (the clone happens before the activity, but the activity might not use all fields).

Recommendation: Consider if these clones are necessary or if references could be used

Testing & Documentation

10. No Tests Included

This is a significant rewrite (1176 lines) that changes core behavior from single-signal to batch-signal processing, but no tests are included in the PR.

Critical Recommendation: Add tests covering:

  • Batch signal processing with various batch sizes
  • Timeout behavior when receiving fewer signals than batch size
  • Race conditions (draining runner receiving start commands)
  • Event ordering guarantees
  • Command acknowledgment logic
  • State persistence and recovery

11. Missing Migration Path

The file is named runner2.rs suggesting this runs alongside the old runner.rs.

Questions:

  • How will existing runner workflows migrate from runner to runner2?
  • Are both implementations expected to run concurrently?
  • Is there a feature flag or rollout plan?
  • Will runner.rs be deprecated?

Recommendation: Document the migration strategy

Minor Issues

12. Inconsistent Naming (lines 427-442)

LifecycleState::new() could use Default trait instead of a custom constructor since it just initializes with default values.

Recommendation:

impl Default for LifecycleState {
    fn default() -> Self { ... }
}

13. Magic Number (line 16)

const EVENT_ACK_BATCH_SIZE: i64 = 500;

Good use of a constant, but consider if this should be configurable via the config system like other thresholds.

Security Concerns

14. Unbounded Batch Size

While the batch size is limited to 1024 signals (line 77), each signal could contain variable-sized data (e.g., prepopulate_actor_names in init, metadata). This could potentially lead to memory issues if many large signals arrive.

Recommendation: Consider adding size-based limits in addition to count-based limits

15. Error Message Information Disclosure (lines 130-133)

bail!(
    "received message that should not be sent to runner workflow: {:?}",
    sig.inner
)

Logging the full message content in an error could potentially expose sensitive information.

Recommendation: Log message type only, not full content, or ensure sensitive fields are redacted

Performance Metrics

The batch processing approach should provide significant performance improvements:

  • Reduced syscalls: Batching reduces context switches
  • Better throughput: Processing 1024 signals vs 1 signal per iteration
  • Reduced activity overhead: Batching messages reduces workflow activity invocations

Recommendation: Add metrics to track:

  • Average batch sizes received
  • Processing latency per batch
  • Time spent waiting vs processing
  • Signal queue depths

Summary & Recommendations

Must Fix Before Merge

  1. Remove unused import vbare::OwnedVersionedData (line 11)
  2. Add division-by-zero safety for total_slots calculations (lines 620, 1057)
  3. Add comprehensive tests for batch signal processing behavior
  4. Document migration strategy from runner.rs to runner2.rs

Should Fix Before Merge

  1. Parallelize independent signal sends (allocation signals, lines 315-321)
  2. Review event processing ordering requirements - consider per-actor parallelization
  3. Resolve the commented-out event storage code - implement or remove
  4. Add performance metrics for batch processing

Consider for Future

  1. Make EVENT_ACK_BATCH_SIZE configurable
  2. Implement batch publishing if supported by pubsub system
  3. Add size-based limits in addition to count-based batch limits
  4. Use Default trait for LifecycleState

Conclusion

This is a well-architected improvement to the runner workflow that should significantly improve performance through batch processing. The code quality is generally good with clear separation of concerns and proper error handling. However, the lack of tests for such a critical component is concerning, and the migration strategy needs clarification. With the critical issues addressed, this will be a solid improvement to the system.

Great work on the batch processing implementation! The signal accumulation logic is particularly well done.

@MasterPtato MasterPtato force-pushed the 11-17-fix_pb_rewrite_runner_wf_to_handle_batch_signals branch from b180eca to 4eabd07 Compare November 18, 2025 23:50
@MasterPtato MasterPtato force-pushed the 11-17-fix_gas_fix_batch_listen_fix_history_for_graceful_signal_send_in_workflows branch from 0ca2fa2 to 321b2c3 Compare November 18, 2025 23:50
@claude
Copy link

claude bot commented Nov 18, 2025

Code Review

I've reviewed PR #3480 which rewrites the runner workflow to handle batch signals. Here's my detailed feedback:

Overview

This PR introduces a new runner2.rs workflow that processes signals in batches instead of one-by-one, which should significantly improve performance and reduce database transaction overhead. The changes update all references from runner to runner2 throughout the codebase.


🎯 Positive Changes

  1. Batch Signal Processing - The core improvement from listen_with_timeout to listen_n_with_timeout(runner_lost_threshold, 1024) is excellent for performance. Processing up to 1024 signals per iteration will dramatically reduce workflow state transitions.

  2. Consolidated Message Sending - The new SendMessagesToRunner activity (plural) batches multiple messages together instead of sending them individually, reducing pubsub overhead.

  3. Efficient Command/Event Batching - Events and commands are accumulated during signal processing and sent in a single activity call, which is much more efficient.

  4. Combined Activity InsertCommands - The new version combines inserting and acknowledging commands into a single activity, reducing state updates.


⚠️ Issues & Concerns

1. Missing Activity Removal in runner.rs

The old runner.rs still has the AckCommands activity (lines 797-806) which is no longer needed in runner2.rs. This suggests that runner.rs should eventually be deleted once runner2 is proven stable. Is there a migration plan?

Recommendation: Document the migration strategy in the PR description or add a TODO comment.

2. Potential Division by Zero

In runner2.rs:620:

let remaining_millislots = (remaining_slots * 1000) / input.total_slots;

If input.total_slots is 0, this will panic. While this is likely validated elsewhere, defensive programming suggests adding an assertion or early return.

Recommendation:

assert\!(input.total_slots > 0, "total_slots must be greater than 0");
let remaining_millislots = (remaining_slots * 1000) / input.total_slots;

3. Duplicate Division by Zero Risk

Same issue exists in runner2.rs:1051:

let new_remaining_millislots = (new_remaining_slots * 1000) / old_runner_alloc_key_data.total_slots;

4. Event Batching Logic Issue

In runner2.rs:113-117:

events.extend(
    new_events
        .into_iter()
        .filter(|event| event.index > state.last_event_idx),
);

The filtering uses state.last_event_idx which is only updated AFTER processing all signals. If multiple ToServerEvents signals arrive in the same batch with overlapping indices, duplicates could slip through.

Recommendation: Track the max event index during batch accumulation:

let mut max_event_idx = state.last_event_idx;
// ... in the loop:
events.extend(
    new_events
        .into_iter()
        .filter(|event| {
            if event.index > max_event_idx {
                max_event_idx = event.index;
                true
            } else {
                false
            }
        }),
);

5. Inconsistent Error Handling for Missing Actor Workflows

The code uses graceful_not_found() and logs a warning when actor workflows aren't found (lines 166-170, 250-255, etc.). This is good, but the warning messages are inconsistent:

  • Some say "likely already stopped"
  • The pattern is repeated multiple times

Recommendation: Extract to a helper function to ensure consistent logging and reduce duplication.

6. Unused Imports

Line 5 imports PROTOCOL_VERSION and versioned, but PROTOCOL_VERSION is only used in the send_messages_to_runner activity. Same in runner.rs. Minor issue but affects code cleanliness.

7. State Management During Batch Processing

The batch processing accumulates ack_last_command_idx using max() (line 122), which is correct. However, if the workflow crashes mid-batch, some signals might be lost. This is probably acceptable given workflow semantics, but worth documenting.


🔒 Security Considerations

  1. JSON Deserialization with unwrap_or_default (lines 752-755, 766-770): Using unwrap_or_default() for invalid JSON is reasonable, but consider logging when invalid metadata is encountered for debugging purposes.

  2. No apparent security vulnerabilities - The changes don't introduce new attack vectors.


🚀 Performance Considerations

Major Improvements:

  • Batching signals reduces workflow state flushes by up to 1024x
  • Consolidated message sending reduces pubsub calls
  • Combined command insert/ack reduces activity invocations

Potential Concerns:

  • Processing 1024 signals in memory could increase memory usage, but this should be negligible for the signal types involved
  • The batch size of 1024 seems reasonable, but might need tuning based on production metrics

🧪 Test Coverage

Missing: The PR doesn't include tests for the new batching behavior. Critical test cases:

  1. Multiple signals of the same type in one batch
  2. Mixed signal types in one batch
  3. Duplicate event indices across multiple signals in the same batch (see issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
  4. Batch with CommandStartActor to a draining runner
  5. Edge case: exactly 1024 signals vs 1025 signals

Recommendation: Add integration tests for runner2 before replacing runner.


📝 Code Quality & Style

  1. Good: Follows existing patterns and Rust conventions
  2. Good: Structured logging with tracing and proper use of ?x syntax
  3. Good: Comprehensive comments explaining complex logic
  4. Minor: The // events: Vec<EventRow> commented field (line 36) should probably be removed if events won't be stored

✅ Recommendations Summary

Critical:

  1. Fix potential duplicate event processing in batch accumulation (issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
  2. Add assertions for division by zero (issues [SVC-2479] Send cluster events to PostHog #2 & [SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)
  3. Add test coverage for batch signal processing

Important:
4. Document migration plan from runner to runner2
5. Consider extracting repeated "actor not found" logging to a helper

Nice to have:
6. Log when invalid metadata is encountered
7. Remove commented events field if not needed
8. Clean up unused imports


📊 Overall Assessment

This is a well-designed performance optimization with significant benefits. The batch processing approach is sound and should dramatically reduce workflow overhead. However, the duplicate event filtering issue needs to be addressed before merging, and comprehensive tests are essential given the critical nature of the runner workflow.

Recommendation: Request Changes - Address the critical issues above, particularly the event batching logic and test coverage.

@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: Rewrite Runner Workflow to Handle Batch Signals

Summary

This PR rewrites the pegboard runner workflow (runner2.rs) to batch-process signals instead of handling them one at a time. This is a significant performance optimization that should reduce database transaction overhead and improve throughput.


✅ Strengths

1. Performance Optimization

  • Batch signal processing: The key improvement is using listen_n_with_timeout() to fetch up to 1024 signals at once (line 76-78), versus the old listen_with_timeout() that processed one signal at a time
  • Reduced activity calls: By batching multiple commands/events, the workflow makes fewer database transactions and pubsub calls
  • Message batching: Multiple messages are accumulated and sent together (lines 324-330), reducing network overhead

2. Improved State Management

  • The batch processing logic cleanly separates signal collection (lines 94-183) from processing (lines 185-330)
  • State updates are properly managed within the loop

3. Code Quality

  • Clear separation of concerns with well-named variables (init, events, commands, check_queue)
  • Proper filtering of duplicate events (lines 113-117)
  • Comments explain non-obvious behavior (e.g., line 236: "should not be parallelized")

⚠️ Issues & Concerns

Critical: Potential Signal Ordering Issue

Location: Lines 101-183

Issue: When batching signals, the order of processing different signal types may not match the order they were received. The current code processes all signals in a single loop and separates them by type into different collections:

for signal in signals {
    match signal {
        Main::Forward(sig) => { /* collect into init/events/ack_last_command_idx */ }
        Main::Command(command) => { /* collect into commands */ }
        Main::CheckQueue(_) => { check_queue = true; }
        Main::Stop(sig) => { handle_stopping(...).await?; }
    }
}

Then processes them in this order:

  1. Init (lines 187-232)
  2. Events (lines 239-256)
  3. Commands (lines 280-303)
  4. Check queue (lines 305-322)

Problem: If a Command signal arrives before a Forward::ToServerInit signal in the batch, the command will be processed first (added to the commands vec), but conceptually init should happen first. Similarly, commands and events may be processed out of order relative to when they arrived.

Recommendation: Consider whether signal ordering matters for correctness. If it does, you may need to:

  • Process signals in arrival order, OR
  • Document why out-of-order processing is safe, OR
  • Ensure init is always processed first if present before processing other signals

Bug: Missing Return After handle_stopping

Location: Line 125 and Line 180

Issue: When ToServerStopping or Stop signal is received, handle_stopping() is called but the code continues to process other signals in the batch. This might not be the intended behavior.

protocol::ToServer::ToServerStopping => {
    handle_stopping(ctx, &input, state, false).await?;
    // No return or break here - continues processing other signals
}

Current behavior: If stopping signal appears in the middle of a batch, subsequent signals in that batch will still be processed.

Recommendation: Consider if you should break out of the signal processing loop when stopping is initiated, or document why continued processing is safe.


Minor: Inconsistent Activity Input Naming

Location: Line 325

Issue: The activity is called SendMessagesToRunnerInput (plural "Messages"), but previously it was SendMessageToRunnerInput (singular). The new version takes Vec<protocol::ToClient> which is appropriate for batching.

Finding: This is actually correct - the new version properly uses plural naming. Good work! ✅


Code Smell: Duplicate Init Handling

Location: Lines 105-109

Issue: The code only keeps the first init signal if multiple arrive:

protocol::ToServer::ToServerInit(init_sig) => {
    if init.is_none() {
        init = Some(init_sig);
        check_queue = true;
    }
}

Questions:

  • Is it possible to receive multiple init signals in a batch?
  • Should subsequent inits be logged as warnings?
  • Could this hide bugs where runners incorrectly send multiple inits?

Recommendation: Add a warning log if a duplicate init is received.


Performance: Sequential Event Processing

Location: Lines 239-256

Issue: Events are still processed sequentially with individual await calls for each actor workflow signal:

for event in &events {
    let actor_id = crate::utils::event_actor_id(&event.inner).to_string();
    let res = ctx
        .signal(crate::workflows::actor::Event { inner: event.inner.clone() })
        .to_workflow::<crate::workflows::actor::Workflow>()
        .tag("actor_id", &actor_id)
        .graceful_not_found()
        .send()
        .await?;
    // ...
}

Recommendation: Consider whether these signals could be sent in parallel (e.g., using futures::future::join_all) if they target different actors. The comment on line 236 says "should not be parallelized" but it's unclear if this applies to events targeting different actors or just events for the same actor.

Impact: If you have 100 events in a batch, this still makes 100 sequential network calls which may negate some of the batching benefits.


Minor: Magic Number in Event Ack Logic

Location: Lines 265-277

Issue: The batching logic for event acks is correct but could be clearer:

if last_event_idx > state.last_event_ack_idx.saturating_add(EVENT_ACK_BATCH_SIZE) {

Recommendation: Add a comment explaining why we batch event acks (to avoid sending an ack message for every single event).


🔒 Security Concerns

No major security issues identified. The changes are primarily performance-oriented and don't introduce new security vulnerabilities.


🧪 Test Coverage

Concerns:

  1. No tests visible in this PR - This is a significant rewrite of core functionality
  2. Critical scenarios to test:
    • Batch processing with mixed signal types
    • Signal ordering semantics
    • Handling stop/stopping signals mid-batch
    • Behavior when batch size limit (1024) is reached
    • Race conditions with draining state and incoming commands
    • Event deduplication logic

Recommendation: Add integration tests that verify:

  • Batches of various sizes are processed correctly
  • Commands are correctly indexed when batched
  • Events are properly deduplicated
  • Init signal handling with concurrent signals
  • Draining behavior with pending signals

📝 Additional Observations

Good Practices

  1. Proper use of graceful_not_found() for actor workflow signals
  2. Clear logging with structured fields (e.g., ?actor_id)
  3. Batching events acknowledgment to reduce message overhead
  4. Database transaction instrumentation with custom_instrument()

Deprecated Code

Line 456: evict_workflow_id: Option<Id> is marked as deprecated but still present. Consider removing if truly unused.


📊 Performance Impact

Expected improvements:

  • Reduced database transactions: Commands are batched into single activity calls
  • Fewer pubsub messages: Multiple ToClient messages sent together
  • Lower workflow execution overhead: Fewer state transitions

Potential concerns:

  • Latency: Waiting to batch signals adds latency (up to runner_lost_threshold duration)
  • Memory: Batching 1024 signals could increase memory usage, especially if events contain large payloads

✅ Final Verdict

This is a solid performance optimization with good implementation quality. The main concerns are:

  1. 🔴 Critical: Verify signal ordering semantics are correct
  2. 🟡 Important: Consider whether to break/return after stopping signals
  3. 🟡 Important: Add comprehensive tests for batch processing scenarios
  4. 🟢 Nice-to-have: Investigate parallelizing actor signals for different actors

Recommendation: Address the critical signal ordering question before merging. Consider adding tests to prevent regressions.


Great work on this optimization! The batch processing approach should significantly improve throughput for busy runners. 🚀

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Rewrite Runner Workflow to Handle Batch Signals

I've reviewed this PR and have the following feedback:

Overview

This PR rewrites the runner workflow (creating runner2.rs) to handle batch signal processing instead of processing signals one at a time. This is a significant architectural improvement that should reduce workflow overhead and improve performance.


Strengths

  1. Batch Signal Processing: The switch from listen_with_timeout to listen_n_with_timeout (line 76-78) allows processing up to 1024 signals in a single loop iteration, which should significantly reduce workflow execution overhead.

  2. Message Batching: Multiple messages are now accumulated and sent together (line 185, 324-330), reducing the number of activity calls to SendMessagesToRunner.

  3. Efficient Event Acknowledgment: Events are still batched with the same EVENT_ACK_BATCH_SIZE logic but now benefit from the overall batching architecture (lines 264-277).

  4. Proper Signal Deduplication: Command acks are properly aggregated using max() (line 122) to handle multiple ack signals in the same batch.

  5. Consistent Error Handling: The workflow maintains the same error handling patterns and graceful degradation.


Potential Issues

Critical: State Management Concern

Location: process_init activity (line 740-801)

The process_init activity reads from state.last_event_idx and state.commands. The LifecycleState tracks last_event_idx separately from the activity State. When process_init returns missed commands, it compares against State.last_command_idx which might be out of sync with what was actually processed in the loop.

Recommendation: Verify state consistency between State and LifecycleState is maintained correctly.


Medium: Sequential Event Processing

Location: Lines 239-256

Events are still processed sequentially with the comment that they should not be parallelized. However, sending 1024 events one-by-one could become a bottleneck. Consider whether strict ordering is required for all events or if some could be parallelized.


Medium: Message Sending Pattern

Location: Lines 324-330, 1132-1150

The implementation sends messages in a loop without batching at the protocol level. Consider whether the protocol supports sending multiple messages in a single publish operation, or if this activity could publish in parallel.


Test Coverage

Are there tests that verify:

  1. Batch signal processing works correctly when receiving the maximum (1024) signals?
  2. State consistency between State and LifecycleState?
  3. Command/event deduplication works correctly in batch scenarios?
  4. The workflow correctly handles mixed signal types in a single batch?

Performance Considerations

Positive Impact:

  • Significant reduction in workflow loop iterations
  • Fewer activity executions for message sending
  • Better throughput for high-signal workloads

Questions:

  1. What is the expected performance improvement? Has this been load-tested?
  2. Are there any concerns about the 1024 batch size causing timeouts in extreme cases?
  3. Should there be metrics added to track batch sizes and processing times?

Migration Strategy

Since this creates runner2.rs alongside the existing runner.rs:

  • When will the old runner workflow be deprecated?
  • Is there a rollback plan if issues are discovered?
  • Are both versions expected to run concurrently during migration?

Overall, this is a solid architectural improvement that should significantly enhance performance. The main concern is ensuring state consistency is maintained correctly with the new batching approach.

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review for PR #3480

I've reviewed this pull request which rewrites the runner workflow to handle batch signal processing.

Strengths & Improvements

  1. Batch Signal Processing: Switching from listen_with_timeout to listen_n_with_timeout processes up to 1024 signals in a single loop iteration - significant performance improvement.

  2. Batch Message Sending: Messages are accumulated and sent together via SendMessagesToRunnerInput, reducing activity calls.

  3. Improved Command/Event Batching: Events and commands are accumulated and processed together, reducing database overhead.

  4. Better State Management: Command acknowledgment logic is cleaner - InsertCommandsInput handles both insertions and acknowledgments in a single activity.

Potential Issues

1. Incomplete File (Critical)

The file appears truncated at line 1176. Line 944 shows incomplete struct name. Please verify full file was committed.

2. Backward Compatibility

SendMessageToRunnerInput (singular) replaced by SendMessagesToRunnerInput (plural) - could break workflow replays. Need migration strategy.

3. Duplicate Code

Both runner.rs and runner2.rs exist with nearly identical code - maintenance burden. Need clear deprecation plan.

4. Sequential Actor Signal Processing

Lines 239-256 send signals sequentially. Could become bottleneck under high event volume. Consider adding telemetry.

5. Missing Validation

Line 122 uses .max() for ack indices - silently handles out-of-order acks. Consider logging these cases.

Code Quality

  • Excellent adherence to logging conventions (structured logging, lowercase messages)
  • Proper error handling with custom error system
  • Code style follows project standards

Testing

Missing:

  • Tests for batch signal processing
  • Performance benchmarks
  • Migration tests for in-flight workflows

Action Items

Must Fix:

  1. Complete truncated file

Should Address:
2. Migration plan for runner.rs to runner2.rs transition
3. Integration tests for batch processing

Nice to Have:
4. Make batch size (1024) configurable
5. Add metrics for batch sizes
6. Deprecation timeline documentation

Overall

Solid performance improvement with good code quality. Main concerns: incomplete file, dual-workflow situation, missing tests. Once addressed, should be ready to merge.


Review generated by Claude Code

@MasterPtato MasterPtato force-pushed the 11-17-fix_gas_fix_batch_listen_fix_history_for_graceful_signal_send_in_workflows branch from aa9ef2a to dfd167c Compare November 20, 2025 03:07
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review - Batch Signal Processing Rewrite

This PR introduces runner2.rs to handle batch signal processing, replacing the single-signal approach with processing up to 1024 signals per iteration.

Strengths

  • Excellent batch processing design using listen_n_with_timeout
  • Well-structured signal accumulation with smart deduplication
  • Efficient message batching reduces pubsub overhead
  • Good error handling with graceful_not_found

Critical Issues

1. Unused Import (line 11)
use vbare::OwnedVersionedData; is not used. Remove it.

2. Potential Division by Zero (lines 620, 1057)
let remaining_millislots = (remaining_slots * 1000) / input.total_slots; will panic if total_slots is 0. Use checked_div() with error handling.

3. No Test Coverage
1176-line rewrite with zero tests. Add tests for: batch processing, timeout behavior, event deduplication, command acknowledgment, draining state, race conditions.

Important Issues

4. Sequential Event Processing (lines 239-256)
Events sent sequentially could bottleneck. If ordering is per-actor (not across actors), consider parallelizing across different actors.

5. Sequential Allocation Signals (lines 315-321)
Independent allocations could be parallelized with futures::future::try_join_all().

6. Missing Migration Documentation
No explanation of runner2 naming, migration strategy, or runner.rs deprecation timeline.

Code Quality

  • Commented code (lines 36, 819): implement or remove
  • Deprecated field (line 455): add removal timeline
  • Unnecessary clone (line 70): input.clone() every iteration
  • Use Default trait for LifecycleState::new()

Security & Performance

  • Memory growth risk: 1024 signals with large payloads
  • Info disclosure (line 132): avoid logging full messages
  • Missing metrics: batch size, latency, queue depths
  • No tracing span for profiling

Summary

Must Fix:

  1. Remove unused import
  2. Add division-by-zero safety
  3. Add comprehensive tests
  4. Document migration strategy

Should Fix:
5. Parallelize allocation signals
6. Consider per-actor event parallelization
7. Add performance metrics

Well-designed improvement with clean batch processing logic. However, lack of tests is concerning and division-by-zero is a potential bug. Great work!

@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Rewrite Runner Workflow to Handle Batch Signals

Overview

This PR introduces runner2.rs as a rewrite of the runner workflow with the primary goal of handling batch signals more efficiently. The implementation changes from processing signals one-at-a-time to batched processing (up to 1024 signals per iteration).

Key Improvements

Batch Processing Efficiency (lines 76-78)

  • Fetches up to 1024 signals in one go with timeout using listen_n_with_timeout
  • Significantly reduces loop iterations for high-throughput scenarios
  • Should improve performance when processing multiple events/commands

Optimized Message Sending (lines 324-329)

  • Batches multiple messages into single SendMessagesToRunnerInput call
  • Old code sent messages one at a time (runner.rs lines 98-108, 112-118, 199-207)
  • New send_messages_to_runner activity processes messages in batch context

Consolidated Command Processing (lines 280-303)

  • Commands are batched, acknowledged, and sent together
  • Old code processed commands individually (runner.rs lines 265-281)
  • Better handling of command indexing when batching

Critical Issue: Migration Path Unclear

The old runner.rs workflow is still present but all callers now use runner2. This raises important questions:

  • How will in-flight workflows using the old runner be handled?
  • Should there be a deprecation notice or migration plan?
  • Can old runner workflows complete gracefully?

Recommendation: Add comments explaining the migration strategy or plan to remove runner.rs in a follow-up PR.

Other Issues

Medium: Activity API Change (lines 832-840)

  • InsertCommandsInput signature changed from i64 to Option
  • Returns None when no commands, better semantics but breaking change
  • Ensure no rollback scenarios where old/new workflows might coexist

Medium: Init Signal Deduplication (lines 105-109)

  • Only first init signal in batch is processed, silently ignoring duplicates
  • This is correct behavior but might hide protocol violations
  • Recommendation: Add warning log for multiple init signals in one batch

Code Quality Assessment

Strengths:

  • Clear batching logic with well-structured signal aggregation loop (lines 101-183)
  • Good comments explaining non-obvious decisions (e.g., line 236 about not parallelizing event forwarding)
  • Consistent error handling maintaining same patterns as original
  • Proper state management with clean LifecycleState handling
  • Follows CLAUDE.md logging conventions with structured logging using ?actor_id format
  • Lowercase log messages following project style
  • Proper tracing instrumentation with custom_instrument

Testing Recommendations:

  1. Test with exactly 1024 signals to ensure proper batch boundaries
  2. Test draining + start actor command race condition (lines 138-172)
  3. Verify command indices are sequential when batching multiple commands
  4. Test event acknowledgment with EVENT_ACK_BATCH_SIZE = 500
  5. Test timeout behavior when fewer than 1024 signals arrive

Performance Considerations

Improvements:

  • Reduced syscalls: Batch message sending reduces pub/sub publish calls
  • Fewer loop iterations: 1024 signals in one iteration vs 1024 iterations
  • Optimized state updates: Fewer state flushes due to batched processing

Potential Concerns:

  • Latency: Batching may introduce slight latency for first signal in batch (waiting for runner_lost_threshold or 1024 signals)
  • Memory: Batching up to 1024 signals plus associated data could increase memory usage per workflow

Recommendation: Monitor memory usage and latency metrics after deployment.

Security

No security issues identified:

  • Proper use of database transactions with appropriate isolation levels
  • No new SQL injection or command injection vectors
  • Maintains existing authentication/authorization patterns

Summary

This is a well-implemented refactor that should significantly improve performance for high-throughput scenarios. The main concern is around migration strategy and ensuring backwards compatibility is properly handled.

Action Items:

  1. Required: Document the migration plan from runner to runner2
  2. Recommended: Add warning log for duplicate init signals in a batch
  3. Recommended: Add integration tests for batch signal processing
  4. Optional: Consider adding metrics to track batch sizes for observability

Approval Status: Approved with recommendations. The code quality is high and the batching approach is sound. Please address the migration path documentation before merging.

@MasterPtato MasterPtato force-pushed the 11-17-fix_gas_fix_batch_listen_fix_history_for_graceful_signal_send_in_workflows branch from dfd167c to bcf4d5d Compare November 21, 2025 19:57
@MasterPtato MasterPtato force-pushed the 11-17-fix_pb_rewrite_runner_wf_to_handle_batch_signals branch from 9a87fd1 to 0560d9b Compare November 21, 2025 19:57
@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: fix(pb): rewrite runner wf to handle batch signals

Summary

This PR introduces a new runner2 workflow that processes signals in batches rather than one at a time, which is a significant improvement for handling high-throughput scenarios. All consumers are updated to use the new runner2 module.

Key Changes

  1. Batch Signal Processing: The new workflow uses listen_n_with_timeout to fetch up to 1024 signals at once, instead of processing them individually with listen_with_timeout. This is a major performance improvement.

  2. Message Batching: Multiple messages to the runner are now collected and sent in a single activity call via SendMessagesToRunnerInput, reducing round trips.

  3. Combined Command Handling: InsertCommandsInput now handles both inserting new commands and acknowledging old ones in a single activity, eliminating the separate AckCommandsInput activity.


Code Quality & Best Practices

Positives:

  • Clean separation of concerns with well-defined activities
  • Good use of structured logging with tracing
  • Proper error handling with graceful_not_found() for actor workflows that may have stopped
  • Clear documentation comments explaining batching behavior

Issues:

  1. Duplicate Code (runner2.rs): The runner2.rs file duplicates ~90% of the code from runner.rs. Consider:

    • Extracting shared activities/types into a common module
    • Or if runner.rs is deprecated, mark it as such or remove it
  2. Protocol Type Mismatch (runner2.rs:1141):

    let message_serialized = versioned::ToClient::wrap_latest(message.clone())

    The new workflow uses protocol::ToClient / versioned::ToClient, while the original uses protocol::ToRunner / versioned::ToRunner. Verify this is intentional and the protocol types are compatible, or this could cause deserialization failures at runtime.

  3. Unused Comment (runner2.rs:395):

    // Set all remaining actors as going away immediately

    The comment says "going away" which is correct, but the original runner.rs:395 says "lost immediately" which is incorrect for that code path.


Potential Bugs

  1. Event Ordering Assumption (runner2.rs:234):

    let last_event_idx = events.last().map(|event| event.index);

    This assumes events arrive in order. If signals arrive out of order and are batched, events.last() may not have the highest index. Consider using events.iter().map(|e| e.index).max() instead.

  2. State Not Updated After Loop (runner2.rs:262):

    state.last_event_idx = last_event_idx;

    This updates state.last_event_idx after the activity, but the filtering at line 113-117 uses state.last_event_idx from before any events were processed in the current batch. This should be fine since deduplication happens per-batch, but be aware of this during replays.

  3. Init Signal Handling (runner2.rs:106-109):

    if init.is_none() {
        init = Some(init_sig);
        check_queue = true;
    }

    Multiple ToServerInit signals in a batch will only process the first one. This is likely intentional, but subsequent init signals are silently ignored. Consider logging a warning if multiple init signals are received.


Performance Considerations

Positives:

  • Batch processing up to 1024 signals is a major throughput improvement
  • Message batching reduces pubsub overhead
  • Combined insert/ack for commands reduces activity count

Potential Improvements:

  1. Parallel Signal Dispatch (runner2.rs:315-321):

    for alloc in res.allocations {
        ctx.signal(alloc.signal)
            // ...
            .send()
            .await?;
    }

    Consider if these signals could be sent in parallel using futures::future::try_join_all since they are independent operations.

  2. Event Forwarding (runner2.rs:239-256): The comment notes signals must be sent in order. If strict ordering isn't required between different actors, events could be parallelized by actor_id grouping.


Security Concerns

No security issues identified. The PR maintains the same trust boundaries and doesn't introduce new attack surfaces.


Test Coverage

Missing:

  • No new tests added for the batch signal handling logic
  • Edge cases to consider testing:
    • Batch with multiple init signals
    • Out-of-order event indices within a batch
    • Mixed signal types (events + commands + stops) in single batch
    • Draining state behavior with batched start actor commands

Minor Suggestions

  1. Magic Number (runner2.rs:77): The 1024 batch size could be a constant like EVENT_ACK_BATCH_SIZE:

    const SIGNAL_BATCH_SIZE: usize = 1024;
  2. Typo in Comment (runner2.rs:373):

    // Close websocket connection (its unlikely to be open)

    Should be "it's" (contraction of "it is").


Verdict

The batch signal processing is a solid improvement for throughput. The main concerns are:

  1. The protocol type change (ToClient vs ToRunner) should be verified
  2. Consider adding tests for batch edge cases
  3. The duplicate code between runner.rs and runner2.rs should be addressed (either share code or deprecate/remove the old module)

Overall, this looks like a positive change once the above items are addressed.

@claude
Copy link

claude bot commented Nov 22, 2025

PR Review: fix(pb): rewrite runner wf to handle batch signals

Summary

This PR introduces runner2.rs, a rewritten runner workflow that processes signals in batches instead of one at a time. All references to the runner workflow have been updated to point to the new runner2 module. The key change is using listen_n_with_timeout to batch up to 1024 signals at once, reducing workflow iterations and improving throughput.

✅ Code Quality & Best Practices

  1. Good batching implementation: The new workflow correctly batches multiple signal types (Forward, Command, CheckQueue, Stop) and processes them efficiently in a single loop iteration.

  2. Proper deduplication: Events are correctly filtered to ignore already-received events (event.index > state.last_event_idx), and init signals only process the first one (if init.is_none()).

  3. Message aggregation: Messages to the runner are aggregated into a single SendMessagesToRunnerInput call instead of multiple individual calls, reducing activity overhead.

  4. Consistent patterns: The new code follows existing patterns in the codebase (structured logging, error handling, activity/workflow structure).

⚠️ Potential Issues & Suggestions

  1. Duplicate code: runner.rs and runner2.rs share nearly identical code (~90% overlap). Consider:

    • Removing runner.rs entirely if it's no longer needed
    • Or adding a deprecation notice with a TODO to remove it
    • The current state leaves technical debt with two implementations to maintain
  2. Protocol type inconsistency (line 100, runner2.rs):

    messages.push(protocol::ToClient::ToClientInit(protocol::ToClientInit {

    The old runner.rs uses protocol::ToRunner::ToClientInit. Verify that protocol::ToClient and protocol::ToRunner are type aliases or that SendMessagesToRunner handles both correctly. If they differ, this could cause serialization issues.

  3. InsertCommandsInput changes: In runner2.rs, InsertCommandsInput adds an ack_last_command_idx field and returns Option<i64> instead of i64. This combines ack + insert logic which is cleaner, but note:

    • The separate AckCommandsInput activity from runner.rs is removed in runner2.rs
    • Ensure backward compatibility if workflows can be replayed
  4. Comment clarification (line 395, runner.rs):

    // Set all remaining actors to lost immediately

    vs runner2.rs (line 303):

    // Set all remaining actors as going away immediately

    The runner2 comment is more accurate since it sends GoingAway not Lost.

🔒 Security Considerations

  • No security concerns identified. The changes are internal workflow optimizations.

⚡ Performance Considerations

  1. Positive: Batch signal processing will significantly reduce workflow state transitions under high load, improving throughput.

  2. Potential concern: Processing 1024 signals in one iteration means longer individual loop iterations. If signal processing fails partway through, more signals may need reprocessing. However, this is mitigated by the workflow framework's replay semantics.

  3. Memory: Batching 1024 signals could temporarily increase memory usage. For high-volume runners, consider if this batch size is appropriate or should be configurable.

🧪 Test Coverage

  • No new tests added. Consider adding tests for:
    • Batch signal processing with mixed signal types
    • Event deduplication behavior
    • Command ack + insert combined behavior
    • Draining state with batched start actor commands

📝 Minor Nits

  1. The ack_last_command_idx check initializes to -1 and compares with != -1. Consider using Option<i64> for clearer semantics.

  2. Unused field evict_workflow_id in InitOutput is marked as deprecated but still present.

Conclusion

The PR accomplishes its goal of batch signal processing efficiently. The main concern is the duplicate code with runner.rs. If both modules will coexist, consider extracting shared logic. If runner.rs is deprecated, document this clearly.

Recommendation: Approve with minor suggestions. Verify the protocol::ToClient vs protocol::ToRunner type usage is correct before merging.

@MasterPtato MasterPtato force-pushed the 11-17-fix_pb_rewrite_runner_wf_to_handle_batch_signals branch from ff15327 to f0e981f Compare November 25, 2025 02:58
@MasterPtato MasterPtato force-pushed the 11-17-fix_gas_fix_batch_listen_fix_history_for_graceful_signal_send_in_workflows branch from 457234c to ea2cfc5 Compare November 25, 2025 02:58
@claude
Copy link

claude bot commented Nov 25, 2025

Pull Request Review: Rewrite Runner Workflow to Handle Batch Signals

This PR rewrites the runner workflow to handle batch signal processing instead of processing signals one at a time - a significant architectural improvement.

Key Strengths

  1. Performance Improvement: Uses listen_n_with_timeout to process up to 1024 signals per iteration instead of one at a time
  2. Message Batching: Collects and sends messages in batches, reducing activity overhead
  3. Cleaner Logic: Commands and events accumulated and processed together
  4. Deduplication: Event deduplication correctly filters using state.last_event_idx

Critical Issues

1. Missing Command Acknowledgment Activity
The original AckCommands activity (lines 797-806 in runner.rs) is removed. While ack logic exists in InsertCommandsInput, the combined activity returns Option instead of i64. Verify this migration is complete and correct.

2. Protocol Type Change
Original uses protocol::ToRunner, new version uses protocol::ToClient. Confirm this is an intentional rename.

3. Init Signal Race
Multiple ToServerInit signals in same batch are silently ignored after first. Consider logging a warning.

4. State Growth
State.commands vector can grow unbounded with faster batch processing. Add limits or warnings.

5. Failure Recovery
Batched commands may have different retry semantics if InsertCommandsInput fails partway through.

Testing Recommendations

  1. Load test with 1000+ signals per batch
  2. Test multiple init signals
  3. Verify command ack cleanup
  4. Test timeout behavior
  5. Verify event/command ordering across batches
  6. Test activity failure recovery

Questions

  1. Why combine insert and ack into one activity?
  2. Have you load-tested the batch processing?
  3. Expected max size for commands vector?

Recommendation

Approve with minor revisions - Verify command ack logic, add duplicate init warning, add bounds checking, ensure test coverage.

Nitpicks

  • Line 209: Use field init shorthand for runner_lost_threshold
  • Extract 1024 to SIGNAL_BATCH_SIZE constant for consistency

@claude
Copy link

claude bot commented Nov 25, 2025

Code Review: Runner Workflow Rewrite for Batch Signal Handling

This PR introduces runner2.rs to handle batch signals efficiently by processing multiple signals per loop iteration instead of one-at-a-time.

Positive Aspects

Performance Improvement - Using listen_n_with_timeout to fetch up to 1024 signals at once (line 77) is excellent optimization.

Efficient Message Batching - Messages accumulated and sent together reduces network overhead (lines 185, 324-330).

Signal Deduplication - Good handling: only first Init processed (line 106), max of AckCommands taken (line 122).

Code Structure - Well-organized separation between batching, processing, and dispatch.

Issues and Concerns

1. Unused Imports - Lines 5, 7, 10, 11 have unused imports. OwnedVersionedData appears unused. Run cargo clippy.

2. Type Inconsistency - Old workflow uses protocol::ToRunner (line 100) but runner2 uses protocol::ToClient (line 205). Verify equivalence.

3. Events Storage - Lines 36, 819-825 show intentionally disabled event storage. Should there be bounded storage for debugging?

4. TODO in Production - Line 667 TODO about combining txn. Create tracking issue if deferred.

5. Integer Overflow - Line 865 casts usize to i64. Use checked arithmetic or document safety.

Security and Testing

Security - Batch size 1024 prevents unbounded growth. Event filtering (line 116) prevents replays.

Testing - No tests for batching behavior. Need unit tests for deduplication, integration tests for ordering, edge case tests.

Migration Path

Both runner.rs and runner2.rs exist. When will old be removed? Any compatibility concerns for in-flight workflows?

Summary

Ratings: Code Quality 4/5, Performance 5/5, Security 5/5, Test Coverage 2/5

Must Fix: Remove unused imports, verify type consistency, add test coverage

Should Address: Document migration plan, create issue for TODO, use checked arithmetic

Well-designed performance improvement. Main concerns are completeness not correctness. Safe to merge with must-fix items addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants