fix(udb): update backoff algo #3469

MasterPtato · 2025-11-14T01:39:57Z

No description provided.

vercel · 2025-11-14T01:39:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 14, 2025 2:40am
rivet-inspector	Ignored	Preview	Nov 14, 2025 2:40am
rivet-site	Ignored	Preview	Nov 14, 2025 2:40am
rivetkit-serverless	Skipped		Nov 14, 2025 2:40am

MasterPtato · 2025-11-14T01:40:12Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(udb): update backoff algo #3469 👈 (View in Graphite)
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-14T01:41:57Z

PR Review: fix(udb): update backoff algo

Summary

This PR updates the transaction retry backoff algorithm to more closely align with FoundationDB's behavior. Overall, the changes look good and represent a significant improvement.

Code Quality: ✅ Good

Strengths:

Excellent documentation with concrete examples showing the backoff progression
References to FoundationDB source code provide good context
Clean, readable implementation

Suggestions:

Documentation clarity: The comment at line 45 says "Jitter is 0-10% of backoff ms" but this is slightly misleading:
- At attempt 0: jitter is 0% (base=1, so jitter is always 0)
- At attempt 1+: jitter is 0-10% as stated
Consider updating the comment to: // Jitter is 0-10% of backoff ms (proportional to base)

Correctness: ✅ Verified

Algorithm Analysis:

Old algorithm:

Max attempt capped at 10
Max backoff: 2^10 * 10 = 10,240ms
Fixed jitter: 0-99ms (not proportional)

New algorithm:

Max attempt capped at 7 ✅
Max backoff: 2^7 * 10 = 1,280ms ✅ (aligns with FDB's 1s max mentioned in docs)
Proportional jitter: 0-10% ✅

The math checks out correctly:

base = 2^min(attempt, 7) produces: 1, 2, 4, 8, 16, 32, 64, 128
base_backoff_ms = base * 10 produces: 10, 20, 40, 80, 160, 320, 640, 1280ms
jitter_ms = rand() % base produces proportional jitter

Potential Issues: ⚠️ Minor

Edge case - Attempt 0:
At attempt 0, base = 1, so jitter_ms = rand::random::<u64>() % 1 will always be 0. This is fine functionally, but means the first retry has no jitter. This could lead to thundering herd issues if many transactions fail simultaneously on their first attempt.

Recommendation: Consider starting jitter from attempt 0. For example:

// Option 1: Always add at least 1ms to the modulo
let jitter_ms = rand::random::<u64>() % (base + 1);

// Option 2: Add a minimum jitter amount
let jitter_ms = rand::random::<u64>() % base.max(10);

However, this may be intentional to match FDB's behavior (which appears to have no jitter based on the referenced code), so this is more of a consideration than a required fix.

Performance: ✅ Excellent

The new algorithm is actually better for performance:

Faster max backoff (1.28s vs 10.24s) means transactions retry more aggressively
Proportional jitter is more appropriate than fixed jitter for preventing thundering herds
Better alignment with FoundationDB's proven retry strategy

Security: ✅ No concerns

The use of rand::random() for jitter is appropriate for this use case (not cryptographic).

Test Coverage: ✅ Adequate

The function is used in:

postgres/database.rs - production usage
rocksdb/database.rs - production usage
tests/rocksdb.rs - test usage (though this test has a duplicate implementation of the old algorithm at lines 140-146)

Note: The test file at tests/rocksdb.rs:140-146 has a duplicate implementation of the old algorithm. This should probably be removed since the test should use the shared implementation from utils::calculate_tx_retry_backoff.

Overall Assessment: ✅ Approve with minor suggestions

This is a solid improvement that brings the retry logic closer to FoundationDB's proven approach. The main suggestions are:

Consider adding jitter for attempt 0 to prevent thundering herd (optional, depends on FDB alignment goals)
Clean up the duplicate implementation in tests/rocksdb.rs
Minor documentation clarification about jitter calculation

Great work on the detailed documentation and FDB source references!

claude · 2025-11-14T02:42:19Z

Code Review: fix(udb): update backoff algo

Summary

This PR updates the transaction retry backoff algorithm in UniversalDB to better align with FoundationDB's behavior. Overall, this is a good improvement that makes the backoff more aggressive and consistent with FDB's design.

✅ Positive Changes

Better alignment with FDB behavior: The new algorithm caps at attempt 7 (max ~1.28s) instead of attempt 10 (max ~10.24s), which is closer to FDB's 1-second max backoff.
Proportional jitter: The jitter is now proportional to the backoff (0-10% of base backoff) rather than a fixed 0-100ms. This is more sensible for larger backoffs.
Excellent documentation: The inline documentation comparing the implementation to FDB with specific links is very helpful.

🐛 Potential Issues

Critical: Modulo bias in jitter calculation

Location: engine/packages/universaldb/src/utils/mod.rs:46

let jitter_ms = rand::random::<u64>() % base;

Issue: Using modulo with rand::random::<u64>() introduces modulo bias. When u64::MAX + 1 is not evenly divisible by base, some values will appear more frequently than others. While the bias is small for small values of base (max 128 in this case), it's still technically incorrect.

Recommendation: Use a proper bounded random range:

use rand::Rng;
let jitter_ms = rand::thread_rng().gen_range(0..base);

This requires adding rand to dependencies if not already present with the rng feature.

Minor: Inconsistency in test file

Location: engine/packages/universaldb/tests/rocksdb.rs:140

The test file has its own copy of calculate_tx_retry_backoff that still uses the old algorithm (max attempt 10). This should be updated to use the exported function from the main library or at least be kept in sync:

// Instead of duplicating, import from the library
use universaldb::calculate_tx_retry_backoff;

📊 Behavioral Analysis

Old algorithm:

Attempt 0: 10ms + 0-100ms jitter = 10-110ms
Attempt 5: 320ms + 0-100ms jitter = 320-420ms
Attempt 10 (max): 10,240ms + 0-100ms jitter = 10.24-10.34s

New algorithm:

Attempt 0: 10ms + 0-1ms jitter = 10-11ms
Attempt 5: 320ms + 0-32ms jitter = 320-352ms
Attempt 7 (max): 1,280ms + 0-128ms jitter = 1.28-1.41s

The new algorithm is:

More aggressive (lower max backoff)
More consistent (proportional jitter)
Better aligned with FDB's 1-second max

🔒 Security Considerations

No security concerns identified.

🧪 Test Coverage

No new tests were added. Consider adding a unit test to verify:

The backoff values for known attempt numbers
That jitter is within expected bounds
That the max cap (attempt 7) works correctly

Example test:

#[test]
fn test_backoff_algorithm() {
    // Test specific attempts
    assert_eq!(calculate_tx_retry_backoff(0), 10..=11);
    assert_eq!(calculate_tx_retry_backoff(1), 20..=22);
    
    // Test max cap
    let backoff_7 = calculate_tx_retry_backoff(7);
    let backoff_100 = calculate_tx_retry_backoff(100);
    // Both should have same base (capped at 7)
    assert!(backoff_7 >= 1280 && backoff_7 < 1408);
    assert!(backoff_100 >= 1280 && backoff_100 < 1408);
}

📝 Recommendations

Required:

Fix the modulo bias in jitter calculation (use rand::thread_rng().gen_range(0..base))
Update or remove the duplicate function in tests/rocksdb.rs:140

Optional:
3. Add unit tests for the backoff algorithm
4. Consider verifying the FDB source links are correct (I couldn't fetch them to verify)

Overall Assessment

✅ Approved with suggested fixes

The core change is sound and improves the backoff algorithm. The modulo bias issue should be addressed before merging, and the test file inconsistency should be resolved.

claude · 2025-11-14T02:42:58Z

Code Review

Summary

This PR updates the exponential backoff algorithm for transaction retries in UniversalDB to better align with FoundationDB's implementation. The changes improve the retry behavior by capping the max backoff and adjusting the jitter calculation.

Positive Changes ✅

Better max backoff cap: Changed from min(10) (max 10.24s) to min(7) (max 1.28s), which is closer to FDB's 1s max
Proportional jitter: The jitter is now 0-10% of the backoff (was fixed 0-100ms), making it scale appropriately with the backoff duration
Excellent documentation: The comments clearly explain the algorithm with examples and reference FDB source code

Issues Found 🔍

1. CRITICAL: Outdated duplicate function in tests

Location: engine/packages/universaldb/tests/rocksdb.rs:140-146

There's a duplicate calculate_tx_retry_backoff function in the test file that still uses the OLD algorithm:

pub fn calculate_tx_retry_backoff(attempt: usize) -> u64 {
    let base_backoff_ms = 2_u64.pow((attempt as u32).min(10)) * 10;
    let jitter_ms = rand::random::<u64>() % 100;
    base_backoff_ms + jitter_ms
}

Action required: This duplicate should be removed and the test should import from universaldb::calculate_tx_retry_backoff instead. Having two different implementations will cause confusion and inconsistent behavior.

2. Potential arithmetic issue in jitter calculation

Location: engine/packages/universaldb/src/utils/mod.rs:46

The jitter calculation rand::random::<u64>() % base has an issue:

For attempt 0: base = 1, so jitter is random % 1 = 0 (always 0, not 0-1ms as documented)
For attempt 1: base = 2, so jitter is random % 2 (0-1ms, not 0-2ms as documented)
For attempt 2: base = 4, so jitter is random % 4 (0-3ms, not 0-4ms as documented)

The comment says "0-10% of backoff ms" but the actual range is 0 to (base-1)ms, which is slightly less than 10% of base_backoff_ms.

Suggested fix:

// Jitter is 0-10% of backoff ms
let jitter_ms = rand::random::<u64>() % (base_backoff_ms / 10 + 1);

This would give:

Attempt 0: 0-1ms jitter (10ms backoff)
Attempt 1: 0-2ms jitter (20ms backoff)
Attempt 2: 0-4ms jitter (40ms backoff)
Attempt 7: 0-128ms jitter (1280ms backoff)

Which matches the documentation exactly.

3. Missing unit tests

There don't appear to be any unit tests specifically for calculate_tx_retry_backoff. Given that this is a critical function for retry behavior, it should have tests verifying:

The exponential growth curve
The max cap at attempt 7+
The jitter is within expected bounds
The values roughly match FDB's behavior

Suggested test:

#[test]
fn test_tx_retry_backoff() {
    // Test base values without jitter (by checking range)
    for attempt in 0..10 {
        let backoff = calculate_tx_retry_backoff(attempt);
        let expected_base = 2_u64.pow((attempt as u32).min(7)) * 10;
        let max_jitter = 2_u64.pow((attempt as u32).min(7));
        
        assert!(backoff >= expected_base);
        assert!(backoff < expected_base + max_jitter);
    }
    
    // Verify max cap
    assert!(calculate_tx_retry_backoff(7) < 1500); // Should be ~1280ms + jitter
    assert!(calculate_tx_retry_backoff(100) < 1500); // Should still be capped
}

Minor Suggestions 💡

Consider using rand::thread_rng().gen_range() instead of rand::random::<u64>() % n to avoid modulo bias (though it's negligible for large u64 values)
Document the dependency on FDB knobs: The comment references specific FDB source files. Consider noting that this should be kept in sync with FDB updates.

Security Considerations 🔒

No security concerns identified. The backoff algorithm is appropriate for preventing retry storms.

Performance Considerations ⚡

The changes improve performance by reducing max backoff from ~10s to ~1.28s, allowing faster recovery from transient conflicts.

Verdict

Good improvement to align with FDB behavior, but needs the test duplicate removed and the jitter calculation fixed to match the documentation. Consider adding unit tests for this critical function.

🤖 Generated with Claude Code

vercel bot temporarily deployed to Preview – rivetkit-serverless November 14, 2025 01:40 Inactive

MasterPtato mentioned this pull request Nov 14, 2025

chore: add debug scripts #3467

Draft

MasterPtato mentioned this pull request Nov 14, 2025

chore: write/update docs #3468

Draft

fix(udb): update backoff algo

a0dfaaf

MasterPtato force-pushed the 11-13-fix_udb_update_backoff_algo branch from 9b19169 to a0dfaaf Compare November 14, 2025 02:40

MasterPtato force-pushed the 11-12-chore_write_update_docs branch from 52bea63 to ce30dce Compare November 14, 2025 02:40

vercel bot temporarily deployed to Preview – rivetkit-serverless November 14, 2025 02:40 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(udb): update backoff algo #3469

fix(udb): update backoff algo #3469

Uh oh!

MasterPtato commented Nov 14, 2025

Uh oh!

vercel bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Nov 14, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 14, 2025

Uh oh!

claude bot commented Nov 14, 2025

Uh oh!

claude bot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(udb): update backoff algo #3469

Are you sure you want to change the base?

fix(udb): update backoff algo #3469

Uh oh!

Conversation

MasterPtato commented Nov 14, 2025

Uh oh!

vercel bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 14, 2025

PR Review: fix(udb): update backoff algo

Summary

Code Quality: ✅ Good

Correctness: ✅ Verified

Potential Issues: ⚠️ Minor

Performance: ✅ Excellent

Security: ✅ No concerns

Test Coverage: ✅ Adequate

Overall Assessment: ✅ Approve with minor suggestions

Uh oh!

claude bot commented Nov 14, 2025

Code Review: fix(udb): update backoff algo

Summary

✅ Positive Changes

🐛 Potential Issues

Critical: Modulo bias in jitter calculation

Minor: Inconsistency in test file

📊 Behavioral Analysis

🔒 Security Considerations

🧪 Test Coverage

📝 Recommendations

Overall Assessment

Uh oh!

claude bot commented Nov 14, 2025

Code Review

Summary

Positive Changes ✅

Issues Found 🔍

1. CRITICAL: Outdated duplicate function in tests

2. Potential arithmetic issue in jitter calculation

3. Missing unit tests

Minor Suggestions 💡

Security Considerations 🔒

Performance Considerations ⚡

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Nov 14, 2025 •

edited

Loading

MasterPtato commented Nov 14, 2025 •

edited

Loading