Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 14, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 14, 2025 2:40am
rivet-inspector Ignored Ignored Preview Nov 14, 2025 2:40am
rivet-site Ignored Ignored Preview Nov 14, 2025 2:40am
rivetkit-serverless Skipped Skipped Nov 14, 2025 2:40am

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 14, 2025 01:40 Inactive
Copy link
Contributor Author

MasterPtato commented Nov 14, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 14, 2025

PR Review: fix(udb): update backoff algo

Summary

This PR updates the transaction retry backoff algorithm to more closely align with FoundationDB's behavior. Overall, the changes look good and represent a significant improvement.

Code Quality: ✅ Good

Strengths:

  • Excellent documentation with concrete examples showing the backoff progression
  • References to FoundationDB source code provide good context
  • Clean, readable implementation

Suggestions:

  1. Documentation clarity: The comment at line 45 says "Jitter is 0-10% of backoff ms" but this is slightly misleading:

    • At attempt 0: jitter is 0% (base=1, so jitter is always 0)
    • At attempt 1+: jitter is 0-10% as stated

    Consider updating the comment to: // Jitter is 0-10% of backoff ms (proportional to base)

Correctness: ✅ Verified

Algorithm Analysis:

Old algorithm:

  • Max attempt capped at 10
  • Max backoff: 2^10 * 10 = 10,240ms
  • Fixed jitter: 0-99ms (not proportional)

New algorithm:

  • Max attempt capped at 7 ✅
  • Max backoff: 2^7 * 10 = 1,280ms ✅ (aligns with FDB's 1s max mentioned in docs)
  • Proportional jitter: 0-10% ✅

The math checks out correctly:

  • base = 2^min(attempt, 7) produces: 1, 2, 4, 8, 16, 32, 64, 128
  • base_backoff_ms = base * 10 produces: 10, 20, 40, 80, 160, 320, 640, 1280ms
  • jitter_ms = rand() % base produces proportional jitter

Potential Issues: ⚠️ Minor

Edge case - Attempt 0:
At attempt 0, base = 1, so jitter_ms = rand::random::<u64>() % 1 will always be 0. This is fine functionally, but means the first retry has no jitter. This could lead to thundering herd issues if many transactions fail simultaneously on their first attempt.

Recommendation: Consider starting jitter from attempt 0. For example:

// Option 1: Always add at least 1ms to the modulo
let jitter_ms = rand::random::<u64>() % (base + 1);

// Option 2: Add a minimum jitter amount
let jitter_ms = rand::random::<u64>() % base.max(10);

However, this may be intentional to match FDB's behavior (which appears to have no jitter based on the referenced code), so this is more of a consideration than a required fix.

Performance: ✅ Excellent

The new algorithm is actually better for performance:

  • Faster max backoff (1.28s vs 10.24s) means transactions retry more aggressively
  • Proportional jitter is more appropriate than fixed jitter for preventing thundering herds
  • Better alignment with FoundationDB's proven retry strategy

Security: ✅ No concerns

The use of rand::random() for jitter is appropriate for this use case (not cryptographic).

Test Coverage: ✅ Adequate

The function is used in:

  1. postgres/database.rs - production usage
  2. rocksdb/database.rs - production usage
  3. tests/rocksdb.rs - test usage (though this test has a duplicate implementation of the old algorithm at lines 140-146)

Note: The test file at tests/rocksdb.rs:140-146 has a duplicate implementation of the old algorithm. This should probably be removed since the test should use the shared implementation from utils::calculate_tx_retry_backoff.

Overall Assessment: ✅ Approve with minor suggestions

This is a solid improvement that brings the retry logic closer to FoundationDB's proven approach. The main suggestions are:

  1. Consider adding jitter for attempt 0 to prevent thundering herd (optional, depends on FDB alignment goals)
  2. Clean up the duplicate implementation in tests/rocksdb.rs
  3. Minor documentation clarification about jitter calculation

Great work on the detailed documentation and FDB source references!

@MasterPtato MasterPtato force-pushed the 11-13-fix_udb_update_backoff_algo branch from 9b19169 to a0dfaaf Compare November 14, 2025 02:40
@MasterPtato MasterPtato force-pushed the 11-12-chore_write_update_docs branch from 52bea63 to ce30dce Compare November 14, 2025 02:40
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 14, 2025 02:40 Inactive
@claude
Copy link

claude bot commented Nov 14, 2025

Code Review: fix(udb): update backoff algo

Summary

This PR updates the transaction retry backoff algorithm in UniversalDB to better align with FoundationDB's behavior. Overall, this is a good improvement that makes the backoff more aggressive and consistent with FDB's design.

✅ Positive Changes

  1. Better alignment with FDB behavior: The new algorithm caps at attempt 7 (max ~1.28s) instead of attempt 10 (max ~10.24s), which is closer to FDB's 1-second max backoff.

  2. Proportional jitter: The jitter is now proportional to the backoff (0-10% of base backoff) rather than a fixed 0-100ms. This is more sensible for larger backoffs.

  3. Excellent documentation: The inline documentation comparing the implementation to FDB with specific links is very helpful.

🐛 Potential Issues

Critical: Modulo bias in jitter calculation

Location: engine/packages/universaldb/src/utils/mod.rs:46

let jitter_ms = rand::random::<u64>() % base;

Issue: Using modulo with rand::random::<u64>() introduces modulo bias. When u64::MAX + 1 is not evenly divisible by base, some values will appear more frequently than others. While the bias is small for small values of base (max 128 in this case), it's still technically incorrect.

Recommendation: Use a proper bounded random range:

use rand::Rng;
let jitter_ms = rand::thread_rng().gen_range(0..base);

This requires adding rand to dependencies if not already present with the rng feature.

Minor: Inconsistency in test file

Location: engine/packages/universaldb/tests/rocksdb.rs:140

The test file has its own copy of calculate_tx_retry_backoff that still uses the old algorithm (max attempt 10). This should be updated to use the exported function from the main library or at least be kept in sync:

// Instead of duplicating, import from the library
use universaldb::calculate_tx_retry_backoff;

📊 Behavioral Analysis

Old algorithm:

  • Attempt 0: 10ms + 0-100ms jitter = 10-110ms
  • Attempt 5: 320ms + 0-100ms jitter = 320-420ms
  • Attempt 10 (max): 10,240ms + 0-100ms jitter = 10.24-10.34s

New algorithm:

  • Attempt 0: 10ms + 0-1ms jitter = 10-11ms
  • Attempt 5: 320ms + 0-32ms jitter = 320-352ms
  • Attempt 7 (max): 1,280ms + 0-128ms jitter = 1.28-1.41s

The new algorithm is:

  • More aggressive (lower max backoff)
  • More consistent (proportional jitter)
  • Better aligned with FDB's 1-second max

🔒 Security Considerations

No security concerns identified.

🧪 Test Coverage

No new tests were added. Consider adding a unit test to verify:

  1. The backoff values for known attempt numbers
  2. That jitter is within expected bounds
  3. That the max cap (attempt 7) works correctly

Example test:

#[test]
fn test_backoff_algorithm() {
    // Test specific attempts
    assert_eq!(calculate_tx_retry_backoff(0), 10..=11);
    assert_eq!(calculate_tx_retry_backoff(1), 20..=22);
    
    // Test max cap
    let backoff_7 = calculate_tx_retry_backoff(7);
    let backoff_100 = calculate_tx_retry_backoff(100);
    // Both should have same base (capped at 7)
    assert!(backoff_7 >= 1280 && backoff_7 < 1408);
    assert!(backoff_100 >= 1280 && backoff_100 < 1408);
}

📝 Recommendations

Required:

  1. Fix the modulo bias in jitter calculation (use rand::thread_rng().gen_range(0..base))
  2. Update or remove the duplicate function in tests/rocksdb.rs:140

Optional:
3. Add unit tests for the backoff algorithm
4. Consider verifying the FDB source links are correct (I couldn't fetch them to verify)

Overall Assessment

Approved with suggested fixes

The core change is sound and improves the backoff algorithm. The modulo bias issue should be addressed before merging, and the test file inconsistency should be resolved.

@claude
Copy link

claude bot commented Nov 14, 2025

Code Review

Summary

This PR updates the exponential backoff algorithm for transaction retries in UniversalDB to better align with FoundationDB's implementation. The changes improve the retry behavior by capping the max backoff and adjusting the jitter calculation.

Positive Changes ✅

  1. Better max backoff cap: Changed from min(10) (max 10.24s) to min(7) (max 1.28s), which is closer to FDB's 1s max
  2. Proportional jitter: The jitter is now 0-10% of the backoff (was fixed 0-100ms), making it scale appropriately with the backoff duration
  3. Excellent documentation: The comments clearly explain the algorithm with examples and reference FDB source code

Issues Found 🔍

1. CRITICAL: Outdated duplicate function in tests

Location: engine/packages/universaldb/tests/rocksdb.rs:140-146

There's a duplicate calculate_tx_retry_backoff function in the test file that still uses the OLD algorithm:

pub fn calculate_tx_retry_backoff(attempt: usize) -> u64 {
    let base_backoff_ms = 2_u64.pow((attempt as u32).min(10)) * 10;
    let jitter_ms = rand::random::<u64>() % 100;
    base_backoff_ms + jitter_ms
}

Action required: This duplicate should be removed and the test should import from universaldb::calculate_tx_retry_backoff instead. Having two different implementations will cause confusion and inconsistent behavior.

2. Potential arithmetic issue in jitter calculation

Location: engine/packages/universaldb/src/utils/mod.rs:46

The jitter calculation rand::random::<u64>() % base has an issue:

  • For attempt 0: base = 1, so jitter is random % 1 = 0 (always 0, not 0-1ms as documented)
  • For attempt 1: base = 2, so jitter is random % 2 (0-1ms, not 0-2ms as documented)
  • For attempt 2: base = 4, so jitter is random % 4 (0-3ms, not 0-4ms as documented)

The comment says "0-10% of backoff ms" but the actual range is 0 to (base-1)ms, which is slightly less than 10% of base_backoff_ms.

Suggested fix:

// Jitter is 0-10% of backoff ms
let jitter_ms = rand::random::<u64>() % (base_backoff_ms / 10 + 1);

This would give:

  • Attempt 0: 0-1ms jitter (10ms backoff)
  • Attempt 1: 0-2ms jitter (20ms backoff)
  • Attempt 2: 0-4ms jitter (40ms backoff)
  • Attempt 7: 0-128ms jitter (1280ms backoff)

Which matches the documentation exactly.

3. Missing unit tests

There don't appear to be any unit tests specifically for calculate_tx_retry_backoff. Given that this is a critical function for retry behavior, it should have tests verifying:

  • The exponential growth curve
  • The max cap at attempt 7+
  • The jitter is within expected bounds
  • The values roughly match FDB's behavior

Suggested test:

#[test]
fn test_tx_retry_backoff() {
    // Test base values without jitter (by checking range)
    for attempt in 0..10 {
        let backoff = calculate_tx_retry_backoff(attempt);
        let expected_base = 2_u64.pow((attempt as u32).min(7)) * 10;
        let max_jitter = 2_u64.pow((attempt as u32).min(7));
        
        assert!(backoff >= expected_base);
        assert!(backoff < expected_base + max_jitter);
    }
    
    // Verify max cap
    assert!(calculate_tx_retry_backoff(7) < 1500); // Should be ~1280ms + jitter
    assert!(calculate_tx_retry_backoff(100) < 1500); // Should still be capped
}

Minor Suggestions 💡

  1. Consider using rand::thread_rng().gen_range() instead of rand::random::<u64>() % n to avoid modulo bias (though it's negligible for large u64 values)

  2. Document the dependency on FDB knobs: The comment references specific FDB source files. Consider noting that this should be kept in sync with FDB updates.

Security Considerations 🔒

No security concerns identified. The backoff algorithm is appropriate for preventing retry storms.

Performance Considerations ⚡

The changes improve performance by reducing max backoff from ~10s to ~1.28s, allowing faster recovery from transient conflicts.

Verdict

Good improvement to align with FDB behavior, but needs the test duplicate removed and the jitter calculation fixed to match the documentation. Consider adding unit tests for this critical function.


🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants