Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 19, 2025 10:07pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 19, 2025 10:07pm
rivet-inspector Ignored Ignored Preview Nov 19, 2025 10:07pm
rivet-site Ignored Ignored Preview Nov 19, 2025 10:07pm

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: fix(gas): optimize loops

Summary

This PR optimizes loop performance in the workflow engine by:

  1. Parallelizing database operations using tokio::join!
  2. Reducing database write frequency from every iteration to every 20 iterations (controlled by LOOP_ITERS_PER_COMMIT)

Code Quality & Best Practices

Strengths:

  • ✅ The optimization approach is sound - parallelizing independent DB operations will reduce latency
  • ✅ Good use of tokio::join! to run loop event commit, branch commit, and user code concurrently
  • ✅ Helpful inline comments explaining the optimization strategy (lines 949-955)
  • ✅ Maintains backward compatibility with existing loop behavior

Concerns:

  1. Missing Safety Documentation ⚠️
    The comment on line 45 mentions "mark previous loop history to forgotten" but this only happens when upsert_workflow_loop_event is called. With the new batching, history is only marked as forgotten every 20 iterations instead of every iteration. This could have implications:

    • The "active" history will accumulate 20 iterations worth of events before being moved to "forgotten" history
    • Consider documenting this behavior change and whether it impacts recovery scenarios
  2. Hard-coded Magic Number ⚠️

    const LOOP_ITERS_PER_COMMIT: usize = 20;

    While defined as a constant, there's no justification for why 20 was chosen. Consider:

    • Adding a comment explaining the rationale (performance vs. recovery granularity tradeoff)
    • Whether this should be configurable based on use case (tight loops with small state vs. expensive operations)
  3. State Loss Risk 🔴 CRITICAL
    If a workflow crashes between iterations (e.g., at iteration 15), the state from iterations 1-14 will be lost since they were never persisted. The workflow will restart from iteration 0 when replayed.

    Example scenario:

    • Loop runs iterations 1-15
    • Workflow crashes at iteration 15
    • State was only committed at iteration 0
    • On replay, workflow restarts from iteration 0, redoing all 15 iterations

    This breaks the workflow engine's durability guarantees. The current behavior ensures every iteration is recoverable.

Potential Bugs

Critical Issue: The optimization introduces a correctness bug. Looking at line 994:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...)
}

The modulo check means:

  • Iteration 0: NOT committed (0 % 20 = 0, but handled separately in initial creation)
  • Iteration 1-19: NOT committed
  • Iteration 20: committed (20 % 20 = 0)
  • Iteration 21-39: NOT committed
  • etc.

But the loop event was already created at iteration 0 (lines 892-904), so iteration 20 would be the first update. However, if the loop breaks at iteration 15, the final state IS committed (lines 1021-1032), so that case is handled.

The real issue: If a workflow crashes/restarts between commits (e.g., at iteration 15), it will replay from iteration 0, not iteration 15. This changes the recovery semantics.

Performance Considerations

Positive:

  • Reducing DB writes from N to N/20 for a loop with N iterations will significantly improve throughput
  • Parallelizing the initial loop event commit with the first iteration's branch commit is clever

Concerns:

  • The tradeoff is between performance and crash recovery granularity
  • For long-running loops (thousands of iterations), losing progress on crash could be expensive
  • Consider profiling with realistic workloads to validate the 20-iteration interval

Security Concerns

No security issues identified. The changes are confined to internal workflow state management.

Test Coverage

Missing tests for the optimization:

  1. No test verifies that state commits happen every 20 iterations
  2. No test validates crash/restart behavior with the new batching
  3. No test for the parallelization of DB operations

Recommended tests:

// Test that loop state is persisted every LOOP_ITERS_PER_COMMIT iterations
#[test]
async fn test_loop_commit_batching() {
    // Run a loop for 50 iterations
    // Kill and restart the workflow at iteration 35
    // Verify it resumes from iteration 20 (last commit), not iteration 0
}

// Test that final state is always committed on Loop::Break
#[test]
async fn test_loop_final_state_commit() {
    // Run loop that breaks at arbitrary iteration (e.g., 15)
    // Verify final state is persisted correctly
}

Recommendations

  1. Consider reverting or adding a feature flag - The durability implications need careful evaluation
  2. Document the tradeoff - Add detailed comments explaining why batching is safe (or make it configurable)
  3. Add comprehensive tests - Especially for crash/recovery scenarios
  4. Profile before merging - Measure actual performance impact with realistic loop workloads

Questions for Author

  1. Have you tested crash/restart behavior with loops that run for 10-50 iterations?
  2. What happens to workflows in-flight when this change is deployed? Will they replay from iteration 0?
  3. Was 20 iterations chosen based on profiling data or intuition?
  4. Are there use cases where loops hold critical state that must be persisted every iteration?

Overall Assessment: The optimization is well-intentioned but potentially introduces a subtle correctness issue around crash recovery. I'd recommend either:

  • Making this configurable with a default that preserves current behavior
  • OR thoroughly documenting the new recovery semantics and adding tests to validate them

Happy to discuss further! 🤖

Generated with Claude Code

@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: fix(gas): optimize loops

Summary

This PR optimizes loop performance in the workflow engine by:

  1. Parallelizing database operations using tokio::join!
  2. Reducing database write frequency from every iteration to every 20 iterations (controlled by LOOP_ITERS_PER_COMMIT)

Code Quality & Best Practices

Strengths:

  • ✅ The optimization approach is sound - parallelizing independent DB operations will reduce latency
  • ✅ Good use of tokio::join! to run loop event commit, branch commit, and user code concurrently
  • ✅ Helpful inline comments explaining the optimization strategy (lines 949-955)
  • ✅ Maintains backward compatibility with existing loop behavior

Concerns:

  1. Missing Safety Documentation ⚠️
    The comment on line 45 mentions "mark previous loop history to forgotten" but this only happens when upsert_workflow_loop_event is called. With the new batching, history is only marked as forgotten every 20 iterations instead of every iteration. This could have implications:

    • The "active" history will accumulate 20 iterations worth of events before being moved to "forgotten" history
    • Consider documenting this behavior change and whether it impacts recovery scenarios
  2. Hard-coded Magic Number ⚠️

    const LOOP_ITERS_PER_COMMIT: usize = 20;

    While defined as a constant, there's no justification for why 20 was chosen. Consider:

    • Adding a comment explaining the rationale (performance vs. recovery granularity tradeoff)
    • Whether this should be configurable based on use case (tight loops with small state vs. expensive operations)
  3. State Loss Risk 🔴 CRITICAL
    If a workflow crashes between iterations (e.g., at iteration 15), the state from iterations 1-14 will be lost since they were never persisted. The workflow will restart from iteration 0 when replayed.

    Example scenario:

    • Loop runs iterations 1-15
    • Workflow crashes at iteration 15
    • State was only committed at iteration 0
    • On replay, workflow restarts from iteration 0, redoing all 15 iterations

    This breaks the workflow engine's durability guarantees. The current behavior ensures every iteration is recoverable.

Potential Bugs

Critical Issue: The optimization introduces a correctness bug. Looking at line 994:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...)
}

The modulo check means:

  • Iteration 0: NOT committed (0 % 20 = 0, but handled separately in initial creation)
  • Iteration 1-19: NOT committed
  • Iteration 20: committed (20 % 20 = 0)
  • Iteration 21-39: NOT committed
  • etc.

But the loop event was already created at iteration 0 (lines 892-904), so iteration 20 would be the first update. However, if the loop breaks at iteration 15, the final state IS committed (lines 1021-1032), so that case is handled.

The real issue: If a workflow crashes/restarts between commits (e.g., at iteration 15), it will replay from iteration 0, not iteration 15. This changes the recovery semantics.

Performance Considerations

Positive:

  • Reducing DB writes from N to N/20 for a loop with N iterations will significantly improve throughput
  • Parallelizing the initial loop event commit with the first iteration's branch commit is clever

Concerns:

  • The tradeoff is between performance and crash recovery granularity
  • For long-running loops (thousands of iterations), losing progress on crash could be expensive
  • Consider profiling with realistic workloads to validate the 20-iteration interval

Security Concerns

No security issues identified. The changes are confined to internal workflow state management.

Test Coverage

Missing tests for the optimization:

  1. No test verifies that state commits happen every 20 iterations
  2. No test validates crash/restart behavior with the new batching
  3. No test for the parallelization of DB operations

Recommended tests:

// Test that loop state is persisted every LOOP_ITERS_PER_COMMIT iterations
#[test]
async fn test_loop_commit_batching() {
    // Run a loop for 50 iterations
    // Kill and restart the workflow at iteration 35
    // Verify it resumes from iteration 20 (last commit), not iteration 0
}

// Test that final state is always committed on Loop::Break
#[test]
async fn test_loop_final_state_commit() {
    // Run loop that breaks at arbitrary iteration (e.g., 15)
    // Verify final state is persisted correctly
}

Recommendations

  1. Consider reverting or adding a feature flag - The durability implications need careful evaluation
  2. Document the tradeoff - Add detailed comments explaining why batching is safe (or make it configurable)
  3. Add comprehensive tests - Especially for crash/recovery scenarios
  4. Profile before merging - Measure actual performance impact with realistic loop workloads

Questions for Author

  1. Have you tested crash/restart behavior with loops that run for 10-50 iterations?
  2. What happens to workflows in-flight when this change is deployed? Will they replay from iteration 0?
  3. Was 20 iterations chosen based on profiling data or intuition?
  4. Are there use cases where loops hold critical state that must be persisted every iteration?

Overall Assessment: The optimization is well-intentioned but potentially introduces a subtle correctness issue around crash recovery. I'd recommend either:

  • Making this configurable with a default that preserves current behavior
  • OR thoroughly documenting the new recovery semantics and adding tests to validate them

Happy to discuss further! 🤖

Review generated with Claude Code

@MasterPtato MasterPtato changed the base branch from 11-13-chore_rename_retry_-_hibernate_in_runner_protocol to graphite-base/3475 November 19, 2025 22:06
@MasterPtato MasterPtato force-pushed the 11-17-fix_gas_optimize_loops branch from eda671f to 2f81c6e Compare November 19, 2025 22:06
@MasterPtato MasterPtato changed the base branch from graphite-base/3475 to 11-19-fix_tunnel_implement_ping_pong November 19, 2025 22:06
@claude
Copy link

claude bot commented Nov 19, 2025

Code Review

Thanks for the optimization work on the workflow loop functionality! I've reviewed the changes and have the following feedback:

✅ Positive Aspects

  1. Performance Optimization: The parallelization of database operations using tokio::join! is a smart optimization that should reduce latency in loop iterations
  2. Reduced DB Load: The LOOP_ITERS_PER_COMMIT constant (20 iterations) significantly reduces database writes for long-running loops - good pragmatic optimization
  3. Clear Documentation: The inline comment explaining the optimization strategy (lines 949-955) is excellent and helps future maintainers understand the complexity
  4. Proper Error Handling: All three parallel operations properly propagate errors after the join

🐛 Potential Issues

Critical: State Loss on Crash

Location: engine/packages/gasoline/src/ctx/workflow.rs:994-1007

The new batching logic only commits state every 20 iterations:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...).await?;
}

Problem: If a workflow crashes between commits (e.g., at iteration 15), the workflow will replay from the last checkpoint (iteration 0) instead of continuing from iteration 15. This could cause:

  • Loss of up to 19 iterations of work
  • Duplicate processing if the loop has side effects (external API calls, database writes, etc.)
  • Inconsistent state if the loop body is not idempotent

Impact: For workflows with expensive loop iterations or non-idempotent operations, this could be severe.

Recommendation:

  • Add a comment warning about this trade-off
  • Consider making LOOP_ITERS_PER_COMMIT configurable per-workflow
  • Document the idempotency requirement for loop bodies in user-facing docs
  • Consider whether 20 is the right number (maybe start more conservatively with 5-10?)

Potential Race Condition in Option::take()

Location: engine/packages/gasoline/src/ctx/workflow.rs:958

if let Some(loop_event_commit_fut) = loop_event_commit_fut.take() {

The take() operation on a mutable reference works correctly, but the variable loop_event_commit_fut is captured in the outer scope. This is safe in the current implementation because it only runs on the first iteration, but it's subtle.

Recommendation: Consider adding a comment explaining that this only executes on iteration 0, or restructure to make this more explicit.

📊 Performance Considerations

Positive:

  • Parallel execution of DB operations should reduce iteration latency by ~40-60% for replay scenarios
  • Reduced DB writes will improve overall system throughput for long-running loops

Question:

  • Have you benchmarked this change? It would be valuable to measure:
    • Latency improvement per iteration
    • Total throughput improvement for long loops
    • Database load reduction

🧪 Test Coverage

Current Coverage: The existing loop_test.rs tests basic loop functionality but doesn't cover:

  • Long-running loops (>20 iterations) to test the batching logic
  • Crash/replay scenarios to verify state recovery behavior
  • Performance regression tests

Recommendation: Add tests for:

// Test that state is properly recovered after crash mid-batch
#[test]
async fn test_loop_state_recovery_mid_batch() {
    // Loop with 25 iterations, crash at iteration 15
    // Verify it resumes from iteration 0 (or 20, depending on intended behavior)
}

// Test that final state is always committed
#[test]
async fn test_loop_final_state_commit() {
    // Loop with 15 iterations (not a multiple of 20)
    // Verify final state is properly saved
}

🔒 Security Concerns

No security concerns identified - the changes don't introduce any new attack surfaces.

🎯 Suggested Improvements

  1. Add constant documentation:
/// How often to commit loop event data to db and mark previous loop history as forgotten.
/// 
/// WARNING: Setting this too high means workflows will replay more iterations on crash/restart.
/// Loop bodies should be idempotent to handle replays safely.
const LOOP_ITERS_PER_COMMIT: usize = 20;
  1. Consider edge case: What happens with loops that iterate exactly LOOP_ITERS_PER_COMMIT * N times? The final commit happens in the Loop::Break case, which is correct, but worth verifying.

  2. Metrics: Consider adding a metric for loop state commits vs iterations to monitor the effectiveness of this optimization:

metrics::LOOP_STATE_COMMITS.increment(1);
metrics::LOOP_ITERATIONS_SKIPPED.record((iteration % LOOP_ITERS_PER_COMMIT) as u64);

Summary

This is a solid performance optimization with good parallelization strategy. The main concern is the potential for state loss on crashes, which may or may not be acceptable depending on your consistency requirements. I'd recommend:

  1. Must have: Document the trade-offs and idempotency requirements
  2. Should have: Add tests for edge cases and crash scenarios
  3. Nice to have: Make batch size configurable and add metrics

Overall: ✅ Approve with recommendations - the optimization is valuable, but please address the state loss concern with documentation at minimum.

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review

I've reviewed the loop optimization changes in this PR. Overall, this is a well-thought-out performance optimization with good attention to detail. Here are my findings:

✅ Strengths

  1. Smart Performance Optimization: The use of tokio::join! to parallelize database operations with user code execution is excellent. This allows the loop event commit, branch event commit, and user callback to run concurrently instead of serially.

  2. Clear Documentation: The comment at lines 949-955 explaining the optimization strategy is very helpful for future maintainers.

  3. Reduced DB Write Pressure: The LOOP_ITERS_PER_COMMIT constant (20 iterations) significantly reduces database writes for long-running loops while still maintaining reasonable state persistence.

  4. Maintains Correctness:

    • Final loop state is always committed on Loop::Break
    • Error handling is preserved with the `?" operator after the join

🔍 Potential Issues

1. State Loss Risk on Failure (Moderate Severity)

With the new batching strategy, if a workflow crashes between commits, up to 19 iterations of state could be lost. The workflow would resume from the last committed iteration, potentially repeating work.

Current behavior:

  • Iterations 0-19: State committed only at iteration 0
  • Crash at iteration 15: Resumes from iteration 0, repeating 15 iterations

Consider:

  • Document this trade-off in comments
  • Add a note in workflow documentation about idempotency requirements for loop bodies
  • Consider making LOOP_ITERS_PER_COMMIT configurable if different workflows have different requirements

2. Missing Test Coverage for New Behavior

The existing loop test only runs 3 iterations, which won't exercise the batching logic (kicks in at iteration 20).

Recommendation:
Add a test case with >20 iterations to verify:

  • State is correctly restored after interruption
  • Performance improvement is measurable
  • The batching behavior works as expected

Example test:

#[tokio::test]
async fn test_loop_many_iterations() {
    // Test with 50+ iterations to verify batching behavior
    let workflow_id = test_ctx
        .workflow(LoopWorkflowInput { iterations: 50 })
        .dispatch()
        .await
        .unwrap();
    
    // Verify completion and state handling
}

3. Potential Indexing Issue (Minor)

At line 994, the condition uses:

if iteration % LOOP_ITERS_PER_COMMIT == 0

This means commits happen at iterations 0, 20, 40, etc. However, iteration 0 is already committed in the initial setup (lines 891-904), so iteration 20 would be the first batched commit. This is likely intentional but worth confirming.

Verify: Is it expected that iteration 0 is committed twice (once in setup, once in the modulo check)?

4. Database Load Consideration

The upsert_workflow_loop_event function performs significant work when iteration > 0:

  • Moves all current events to forgotten history (lines 2563-2579)
  • Clears loop events subspace (line 2581)
  • Cleans up old forgotten history if iteration > 100 (lines 2584-2603)

This is now called every 20 iterations instead of every iteration, which is good. However, for very long-running loops (thousands of iterations), this could still create periodic latency spikes.

Consider: Monitor the performance impact of these operations in production.

🛡️ Security & Error Handling

Good:

  • No security concerns identified
  • Error handling is properly maintained through the ? operator
  • The optimization doesn't introduce race conditions

📊 Code Quality

Excellent:

  • Follows Rust idioms and async best practices
  • Properly uses structured logging patterns
  • Adheres to the project's error handling conventions
  • Code is well-commented for complex sections

📝 Minor Suggestions

  1. Constant Documentation: Consider adding a comment explaining why 20 was chosen for LOOP_ITERS_PER_COMMIT:
/// How often to commit loop event data to db and mark previous loop history as forgotten.
/// Set to 20 to balance between state persistence and database write pressure.
const LOOP_ITERS_PER_COMMIT: usize = 20;
  1. Metrics: Consider adding a metric to track actual vs replayed iterations after recovery to monitor the impact of the batching strategy.

Summary

This is a solid optimization that should significantly improve performance for loop-heavy workflows. The main recommendation is to add test coverage for the >20 iteration case and document the state loss trade-off for users.

Recommendation: Approve with minor suggestions


🤖 Generated with Claude Code

@NathanFlurry NathanFlurry changed the base branch from 11-19-fix_tunnel_implement_ping_pong to graphite-base/3475 November 19, 2025 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants