fix(gas): optimize loops #3475

MasterPtato · 2025-11-18T02:30:04Z

No description provided.

vercel · 2025-11-18T02:30:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 19, 2025 10:07pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 19, 2025 10:07pm
rivet-inspector	Ignored	Preview	Nov 19, 2025 10:07pm
rivet-site	Ignored	Preview	Nov 19, 2025 10:07pm

MasterPtato · 2025-11-18T02:30:23Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-18T02:31:47Z

PR Review: fix(gas): optimize loops

Summary

This PR optimizes loop performance in the workflow engine by:

Parallelizing database operations using tokio::join!
Reducing database write frequency from every iteration to every 20 iterations (controlled by LOOP_ITERS_PER_COMMIT)

Code Quality & Best Practices

Strengths:

✅ The optimization approach is sound - parallelizing independent DB operations will reduce latency
✅ Good use of tokio::join! to run loop event commit, branch commit, and user code concurrently
✅ Helpful inline comments explaining the optimization strategy (lines 949-955)
✅ Maintains backward compatibility with existing loop behavior

Concerns:

Missing Safety Documentation ⚠️
The comment on line 45 mentions "mark previous loop history to forgotten" but this only happens when upsert_workflow_loop_event is called. With the new batching, history is only marked as forgotten every 20 iterations instead of every iteration. This could have implications:
- The "active" history will accumulate 20 iterations worth of events before being moved to "forgotten" history
- Consider documenting this behavior change and whether it impacts recovery scenarios
Hard-coded Magic Number ⚠️
```
const LOOP_ITERS_PER_COMMIT: usize = 20;
```
While defined as a constant, there's no justification for why 20 was chosen. Consider:
- Adding a comment explaining the rationale (performance vs. recovery granularity tradeoff)
- Whether this should be configurable based on use case (tight loops with small state vs. expensive operations)
State Loss Risk 🔴 CRITICAL
If a workflow crashes between iterations (e.g., at iteration 15), the state from iterations 1-14 will be lost since they were never persisted. The workflow will restart from iteration 0 when replayed.

Example scenario:
- Loop runs iterations 1-15
- Workflow crashes at iteration 15
- State was only committed at iteration 0
- On replay, workflow restarts from iteration 0, redoing all 15 iterations
This breaks the workflow engine's durability guarantees. The current behavior ensures every iteration is recoverable.

Potential Bugs

Critical Issue: The optimization introduces a correctness bug. Looking at line 994:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...)
}

The modulo check means:

Iteration 0: NOT committed (0 % 20 = 0, but handled separately in initial creation)
Iteration 1-19: NOT committed
Iteration 20: committed (20 % 20 = 0)
Iteration 21-39: NOT committed
etc.

But the loop event was already created at iteration 0 (lines 892-904), so iteration 20 would be the first update. However, if the loop breaks at iteration 15, the final state IS committed (lines 1021-1032), so that case is handled.

The real issue: If a workflow crashes/restarts between commits (e.g., at iteration 15), it will replay from iteration 0, not iteration 15. This changes the recovery semantics.

Performance Considerations

Positive:

Reducing DB writes from N to N/20 for a loop with N iterations will significantly improve throughput
Parallelizing the initial loop event commit with the first iteration's branch commit is clever

Concerns:

The tradeoff is between performance and crash recovery granularity
For long-running loops (thousands of iterations), losing progress on crash could be expensive
Consider profiling with realistic workloads to validate the 20-iteration interval

Security Concerns

No security issues identified. The changes are confined to internal workflow state management.

Test Coverage

Missing tests for the optimization:

No test verifies that state commits happen every 20 iterations
No test validates crash/restart behavior with the new batching
No test for the parallelization of DB operations

Recommended tests:

// Test that loop state is persisted every LOOP_ITERS_PER_COMMIT iterations
#[test]
async fn test_loop_commit_batching() {
    // Run a loop for 50 iterations
    // Kill and restart the workflow at iteration 35
    // Verify it resumes from iteration 20 (last commit), not iteration 0
}

// Test that final state is always committed on Loop::Break
#[test]
async fn test_loop_final_state_commit() {
    // Run loop that breaks at arbitrary iteration (e.g., 15)
    // Verify final state is persisted correctly
}

Recommendations

Consider reverting or adding a feature flag - The durability implications need careful evaluation
Document the tradeoff - Add detailed comments explaining why batching is safe (or make it configurable)
Add comprehensive tests - Especially for crash/recovery scenarios
Profile before merging - Measure actual performance impact with realistic loop workloads

Questions for Author

Have you tested crash/restart behavior with loops that run for 10-50 iterations?
What happens to workflows in-flight when this change is deployed? Will they replay from iteration 0?
Was 20 iterations chosen based on profiling data or intuition?
Are there use cases where loops hold critical state that must be persisted every iteration?

Overall Assessment: The optimization is well-intentioned but potentially introduces a subtle correctness issue around crash recovery. I'd recommend either:

Making this configurable with a default that preserves current behavior
OR thoroughly documenting the new recovery semantics and adding tests to validate them

Happy to discuss further! 🤖

Generated with Claude Code

claude · 2025-11-18T23:51:24Z

PR Review: fix(gas): optimize loops

Summary

This PR optimizes loop performance in the workflow engine by:

Parallelizing database operations using tokio::join!
Reducing database write frequency from every iteration to every 20 iterations (controlled by LOOP_ITERS_PER_COMMIT)

Code Quality & Best Practices

Strengths:

✅ The optimization approach is sound - parallelizing independent DB operations will reduce latency
✅ Good use of tokio::join! to run loop event commit, branch commit, and user code concurrently
✅ Helpful inline comments explaining the optimization strategy (lines 949-955)
✅ Maintains backward compatibility with existing loop behavior

Concerns:

Missing Safety Documentation ⚠️
The comment on line 45 mentions "mark previous loop history to forgotten" but this only happens when upsert_workflow_loop_event is called. With the new batching, history is only marked as forgotten every 20 iterations instead of every iteration. This could have implications:
- The "active" history will accumulate 20 iterations worth of events before being moved to "forgotten" history
- Consider documenting this behavior change and whether it impacts recovery scenarios
Hard-coded Magic Number ⚠️
```
const LOOP_ITERS_PER_COMMIT: usize = 20;
```
While defined as a constant, there's no justification for why 20 was chosen. Consider:
- Adding a comment explaining the rationale (performance vs. recovery granularity tradeoff)
- Whether this should be configurable based on use case (tight loops with small state vs. expensive operations)
State Loss Risk 🔴 CRITICAL
If a workflow crashes between iterations (e.g., at iteration 15), the state from iterations 1-14 will be lost since they were never persisted. The workflow will restart from iteration 0 when replayed.

Example scenario:
- Loop runs iterations 1-15
- Workflow crashes at iteration 15
- State was only committed at iteration 0
- On replay, workflow restarts from iteration 0, redoing all 15 iterations
This breaks the workflow engine's durability guarantees. The current behavior ensures every iteration is recoverable.

Potential Bugs

Critical Issue: The optimization introduces a correctness bug. Looking at line 994:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...)
}

The modulo check means:

Iteration 0: NOT committed (0 % 20 = 0, but handled separately in initial creation)
Iteration 1-19: NOT committed
Iteration 20: committed (20 % 20 = 0)
Iteration 21-39: NOT committed
etc.

But the loop event was already created at iteration 0 (lines 892-904), so iteration 20 would be the first update. However, if the loop breaks at iteration 15, the final state IS committed (lines 1021-1032), so that case is handled.

The real issue: If a workflow crashes/restarts between commits (e.g., at iteration 15), it will replay from iteration 0, not iteration 15. This changes the recovery semantics.

Performance Considerations

Positive:

Reducing DB writes from N to N/20 for a loop with N iterations will significantly improve throughput
Parallelizing the initial loop event commit with the first iteration's branch commit is clever

Concerns:

The tradeoff is between performance and crash recovery granularity
For long-running loops (thousands of iterations), losing progress on crash could be expensive
Consider profiling with realistic workloads to validate the 20-iteration interval

Security Concerns

No security issues identified. The changes are confined to internal workflow state management.

Test Coverage

Missing tests for the optimization:

No test verifies that state commits happen every 20 iterations
No test validates crash/restart behavior with the new batching
No test for the parallelization of DB operations

Recommended tests:

// Test that loop state is persisted every LOOP_ITERS_PER_COMMIT iterations
#[test]
async fn test_loop_commit_batching() {
    // Run a loop for 50 iterations
    // Kill and restart the workflow at iteration 35
    // Verify it resumes from iteration 20 (last commit), not iteration 0
}

// Test that final state is always committed on Loop::Break
#[test]
async fn test_loop_final_state_commit() {
    // Run loop that breaks at arbitrary iteration (e.g., 15)
    // Verify final state is persisted correctly
}

Recommendations

Consider reverting or adding a feature flag - The durability implications need careful evaluation
Document the tradeoff - Add detailed comments explaining why batching is safe (or make it configurable)
Add comprehensive tests - Especially for crash/recovery scenarios
Profile before merging - Measure actual performance impact with realistic loop workloads

Questions for Author

Have you tested crash/restart behavior with loops that run for 10-50 iterations?
What happens to workflows in-flight when this change is deployed? Will they replay from iteration 0?
Was 20 iterations chosen based on profiling data or intuition?
Are there use cases where loops hold critical state that must be persisted every iteration?

Overall Assessment: The optimization is well-intentioned but potentially introduces a subtle correctness issue around crash recovery. I'd recommend either:

Making this configurable with a default that preserves current behavior
OR thoroughly documenting the new recovery semantics and adding tests to validate them

Happy to discuss further! 🤖

Review generated with Claude Code

claude · 2025-11-19T22:08:21Z

Code Review

Thanks for the optimization work on the workflow loop functionality! I've reviewed the changes and have the following feedback:

✅ Positive Aspects

Performance Optimization: The parallelization of database operations using tokio::join! is a smart optimization that should reduce latency in loop iterations
Reduced DB Load: The LOOP_ITERS_PER_COMMIT constant (20 iterations) significantly reduces database writes for long-running loops - good pragmatic optimization
Clear Documentation: The inline comment explaining the optimization strategy (lines 949-955) is excellent and helps future maintainers understand the complexity
Proper Error Handling: All three parallel operations properly propagate errors after the join

🐛 Potential Issues

Critical: State Loss on Crash

Location: engine/packages/gasoline/src/ctx/workflow.rs:994-1007

The new batching logic only commits state every 20 iterations:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...).await?;
}

Problem: If a workflow crashes between commits (e.g., at iteration 15), the workflow will replay from the last checkpoint (iteration 0) instead of continuing from iteration 15. This could cause:

Loss of up to 19 iterations of work
Duplicate processing if the loop has side effects (external API calls, database writes, etc.)
Inconsistent state if the loop body is not idempotent

Impact: For workflows with expensive loop iterations or non-idempotent operations, this could be severe.

Recommendation:

Add a comment warning about this trade-off
Consider making LOOP_ITERS_PER_COMMIT configurable per-workflow
Document the idempotency requirement for loop bodies in user-facing docs
Consider whether 20 is the right number (maybe start more conservatively with 5-10?)

Potential Race Condition in Option::take()

Location: engine/packages/gasoline/src/ctx/workflow.rs:958

if let Some(loop_event_commit_fut) = loop_event_commit_fut.take() {

The take() operation on a mutable reference works correctly, but the variable loop_event_commit_fut is captured in the outer scope. This is safe in the current implementation because it only runs on the first iteration, but it's subtle.

Recommendation: Consider adding a comment explaining that this only executes on iteration 0, or restructure to make this more explicit.

📊 Performance Considerations

Positive:

Parallel execution of DB operations should reduce iteration latency by ~40-60% for replay scenarios
Reduced DB writes will improve overall system throughput for long-running loops

Question:

Have you benchmarked this change? It would be valuable to measure:
- Latency improvement per iteration
- Total throughput improvement for long loops
- Database load reduction

🧪 Test Coverage

Current Coverage: The existing loop_test.rs tests basic loop functionality but doesn't cover:

Long-running loops (>20 iterations) to test the batching logic
Crash/replay scenarios to verify state recovery behavior
Performance regression tests

Recommendation: Add tests for:

// Test that state is properly recovered after crash mid-batch
#[test]
async fn test_loop_state_recovery_mid_batch() {
    // Loop with 25 iterations, crash at iteration 15
    // Verify it resumes from iteration 0 (or 20, depending on intended behavior)
}

// Test that final state is always committed
#[test]
async fn test_loop_final_state_commit() {
    // Loop with 15 iterations (not a multiple of 20)
    // Verify final state is properly saved
}

🔒 Security Concerns

No security concerns identified - the changes don't introduce any new attack surfaces.

🎯 Suggested Improvements

Add constant documentation:

/// How often to commit loop event data to db and mark previous loop history as forgotten.
/// 
/// WARNING: Setting this too high means workflows will replay more iterations on crash/restart.
/// Loop bodies should be idempotent to handle replays safely.
const LOOP_ITERS_PER_COMMIT: usize = 20;

Consider edge case: What happens with loops that iterate exactly LOOP_ITERS_PER_COMMIT * N times? The final commit happens in the Loop::Break case, which is correct, but worth verifying.
Metrics: Consider adding a metric for loop state commits vs iterations to monitor the effectiveness of this optimization:

metrics::LOOP_STATE_COMMITS.increment(1);
metrics::LOOP_ITERATIONS_SKIPPED.record((iteration % LOOP_ITERS_PER_COMMIT) as u64);

Summary

This is a solid performance optimization with good parallelization strategy. The main concern is the potential for state loss on crashes, which may or may not be acceptable depending on your consistency requirements. I'd recommend:

Must have: Document the trade-offs and idempotency requirements
Should have: Add tests for edge cases and crash scenarios
Nice to have: Make batch size configurable and add metrics

Overall: ✅ Approve with recommendations - the optimization is valuable, but please address the state loss concern with documentation at minimum.

claude · 2025-11-19T22:08:28Z

Code Review

I've reviewed the loop optimization changes in this PR. Overall, this is a well-thought-out performance optimization with good attention to detail. Here are my findings:

✅ Strengths

Smart Performance Optimization: The use of tokio::join! to parallelize database operations with user code execution is excellent. This allows the loop event commit, branch event commit, and user callback to run concurrently instead of serially.
Clear Documentation: The comment at lines 949-955 explaining the optimization strategy is very helpful for future maintainers.
Reduced DB Write Pressure: The LOOP_ITERS_PER_COMMIT constant (20 iterations) significantly reduces database writes for long-running loops while still maintaining reasonable state persistence.
Maintains Correctness:
- Final loop state is always committed on Loop::Break
- Error handling is preserved with the `?" operator after the join

🔍 Potential Issues

1. State Loss Risk on Failure (Moderate Severity)

With the new batching strategy, if a workflow crashes between commits, up to 19 iterations of state could be lost. The workflow would resume from the last committed iteration, potentially repeating work.

Current behavior:

Iterations 0-19: State committed only at iteration 0
Crash at iteration 15: Resumes from iteration 0, repeating 15 iterations

Consider:

Document this trade-off in comments
Add a note in workflow documentation about idempotency requirements for loop bodies
Consider making LOOP_ITERS_PER_COMMIT configurable if different workflows have different requirements

2. Missing Test Coverage for New Behavior

The existing loop test only runs 3 iterations, which won't exercise the batching logic (kicks in at iteration 20).

Recommendation:
Add a test case with >20 iterations to verify:

State is correctly restored after interruption
Performance improvement is measurable
The batching behavior works as expected

Example test:

#[tokio::test]
async fn test_loop_many_iterations() {
    // Test with 50+ iterations to verify batching behavior
    let workflow_id = test_ctx
        .workflow(LoopWorkflowInput { iterations: 50 })
        .dispatch()
        .await
        .unwrap();
    
    // Verify completion and state handling
}

3. Potential Indexing Issue (Minor)

At line 994, the condition uses:

if iteration % LOOP_ITERS_PER_COMMIT == 0

This means commits happen at iterations 0, 20, 40, etc. However, iteration 0 is already committed in the initial setup (lines 891-904), so iteration 20 would be the first batched commit. This is likely intentional but worth confirming.

Verify: Is it expected that iteration 0 is committed twice (once in setup, once in the modulo check)?

4. Database Load Consideration

The upsert_workflow_loop_event function performs significant work when iteration > 0:

Moves all current events to forgotten history (lines 2563-2579)
Clears loop events subspace (line 2581)
Cleans up old forgotten history if iteration > 100 (lines 2584-2603)

This is now called every 20 iterations instead of every iteration, which is good. However, for very long-running loops (thousands of iterations), this could still create periodic latency spikes.

Consider: Monitor the performance impact of these operations in production.

🛡️ Security & Error Handling

Good:

No security concerns identified
Error handling is properly maintained through the ? operator
The optimization doesn't introduce race conditions

📊 Code Quality

Excellent:

Follows Rust idioms and async best practices
Properly uses structured logging patterns
Adheres to the project's error handling conventions
Code is well-commented for complex sections

📝 Minor Suggestions

Constant Documentation: Consider adding a comment explaining why 20 was chosen for LOOP_ITERS_PER_COMMIT:

/// How often to commit loop event data to db and mark previous loop history as forgotten.
/// Set to 20 to balance between state persistence and database write pressure.
const LOOP_ITERS_PER_COMMIT: usize = 20;

Metrics: Consider adding a metric to track actual vs replayed iterations after recovery to monitor the impact of the batching strategy.

Summary

This is a solid optimization that should significantly improve performance for loop-heavy workflows. The main recommendation is to add test coverage for the >20 iteration case and document the state loss trade-off for users.

Recommendation: Approve with minor suggestions ✅

🤖 Generated with Claude Code

vercel bot temporarily deployed to Preview – rivetkit-serverless November 18, 2025 02:30 Inactive

This was referenced Nov 18, 2025

chore: add debug scripts #3467

Open

chore: write/update docs #3468

Open

fix(udb): update backoff algo #3469

Open

MasterPtato force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from bbe84e7 to e83ac9b Compare November 18, 2025 23:43

MasterPtato mentioned this pull request Nov 18, 2025

fix(guard): handle actor stopped during ws/req #3481

Open

MasterPtato force-pushed the 11-17-fix_gas_optimize_loops branch from e5b2d7f to eda671f Compare November 18, 2025 23:50

This was referenced Nov 18, 2025

fix(gas): fix loop forgotten bug due to concurrency #3482

Draft

fix(pb): simplify runner wf #3483

Draft

vercel bot had a problem deploying to Preview – rivetkit-serverless November 18, 2025 23:52 Failure

This was referenced Nov 19, 2025

chore(rivetkit): implement new hibernating ws protocol #3464

Open

fix(actor-kv): fix listing keys #3484

Open

chore(engine-runner): remove echo close on ToClientWebSocketClose #3487

Open

chore(pegboard-gateway): remove TunnelAck #3488

Open

jog1t mentioned this pull request Nov 19, 2025

refactor: performance improvements for inspector #3452

Draft

MasterPtato mentioned this pull request Nov 19, 2025

fix(tunnel): implement ping pong #3491

Open

NathanFlurry mentioned this pull request Nov 19, 2025

chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

Open

fix(gas): optimize loops

2f81c6e

MasterPtato changed the base branch from 11-13-chore_rename_retry_-_hibernate_in_runner_protocol to graphite-base/3475 November 19, 2025 22:06

MasterPtato force-pushed the 11-17-fix_gas_optimize_loops branch from eda671f to 2f81c6e Compare November 19, 2025 22:06

MasterPtato changed the base branch from graphite-base/3475 to 11-19-fix_tunnel_implement_ping_pong November 19, 2025 22:06

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 22:07 Failure

NathanFlurry changed the base branch from 11-19-fix_tunnel_implement_ping_pong to graphite-base/3475 November 19, 2025 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gas): optimize loops #3475

fix(gas): optimize loops #3475

MasterPtato commented Nov 18, 2025

Uh oh!

vercel bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Nov 18, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 18, 2025

Uh oh!

claude bot commented Nov 18, 2025

Uh oh!

claude bot commented Nov 19, 2025

Uh oh!

claude bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(gas): optimize loops #3475

Are you sure you want to change the base?

fix(gas): optimize loops #3475

Conversation

MasterPtato commented Nov 18, 2025

Uh oh!

vercel bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 18, 2025

PR Review: fix(gas): optimize loops

Summary

Code Quality & Best Practices

Potential Bugs

Performance Considerations

Security Concerns

Test Coverage

Recommendations

Questions for Author

Uh oh!

claude bot commented Nov 18, 2025

PR Review: fix(gas): optimize loops

Summary

Code Quality & Best Practices

Potential Bugs

Performance Considerations

Security Concerns

Test Coverage

Recommendations

Questions for Author

Uh oh!

claude bot commented Nov 19, 2025

Code Review

✅ Positive Aspects

🐛 Potential Issues

Critical: State Loss on Crash

Potential Race Condition in Option::take()

📊 Performance Considerations

🧪 Test Coverage

🔒 Security Concerns

🎯 Suggested Improvements

Summary

Uh oh!

claude bot commented Nov 19, 2025

Code Review

✅ Strengths

🔍 Potential Issues

1. State Loss Risk on Failure (Moderate Severity)

2. Missing Test Coverage for New Behavior

3. Potential Indexing Issue (Minor)

4. Database Load Consideration

🛡️ Security & Error Handling

📊 Code Quality

📝 Minor Suggestions

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Nov 18, 2025 •

edited

Loading

MasterPtato commented Nov 18, 2025 •

edited

Loading