Skip to content

Conversation

julianpryde
Copy link
Contributor

@julianpryde julianpryde commented Aug 30, 2025

Description

Adds a random jitter to DNS retries called with the stagger_call function.

Breaking Changes

N/A

Notes & open questions

Further research can determine if additional timeout jitters are needed when calling the other DnsResolver public functions.

Nominal 20% of default delay time as maximum deviation from original timeout delay. Real world and lab testing can dermine if this is sufficient.

Partially addresses #3017.

Change checklist

  • Self-review.
  • Tests if relevant.

@n0bot n0bot bot added this to iroh Aug 30, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Aug 30, 2025
}
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need an async test at all, use #[test] instead of #[tokio::test] and call add_jitter() directly in the tests and then you don't need the jitter_test_backend indirection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in the edit

@julianpryde julianpryde changed the title add jitter to dns call timeoutes when made in the stagger_call func… fix(iroh) add jitter to dns call timeout made in the stagger_call function Sep 6, 2025
@julianpryde julianpryde changed the title fix(iroh) add jitter to dns call timeout made in the stagger_call function fix(iroh): add jitter to dns call timeout made in the stagger_call function Sep 6, 2025
@julianpryde
Copy link
Contributor Author

  • Updated PR title to match format
  • Made tests synchronous
  • Updated the jitter calculation logic
    • It seems like the truly right way to calculate jitter and account for bounds errors is to create a TimeoutDelay struct in n0_computer/time.rs that has bounds and errors for if those bounds are exceeded, but that seems excessive at this point.
  • Ensured clippy check will pass

@@ -548,4 +566,31 @@ pub(crate) mod tests {
let result = stagger_call(f, &delays).await.unwrap();
assert_eq!(result, 5)
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't use any async? Just #[test] should do.

#[traced_test]
fn jitter_test_nonzero_positive() {
let delay: u64 = 300;
for _ in 0..1000000{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test take? I understand that you needed some confidence, but to run each time this might be a bit much. I guess this is a usecase for proptest, but also appreciate that is a lot for this simple change. Maybe run it with something like 10 or 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some zeros because I was paranoid I had done the math wrong and it didn't add any noticeable delay when I ran the tests. I can lower it though to save CI/CD compute

//Sanity checks that I did the math right
#[test]
#[traced_test]
fn jitter_test_nonzero_positive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably nitpicking. But the naming of _negative & _positive confused me. Maybe _lower_bound and _upper_bound are better?

@flub
Copy link
Contributor

flub commented Sep 8, 2025

The new way of calculating is a lot easier to follow! Thanks for these changes.

@flub
Copy link
Contributor

flub commented Sep 8, 2025

Could you please also fix the formatting issues? The cargo deny and was32 CI failures are not due to this PR and we'll fix them outside of this.

@julianpryde
Copy link
Contributor Author

  • Updated comment on MAX_JITTER_PERCENT definition to make sense
  • Made jitter_test_zero a synchronous test
  • Reduced number of iterations of jitter bounds tests
  • Renamed jitter bounds tests
  • ran cargo fmt

@@ -28,6 +28,9 @@ pub const N0_DNS_NODE_ORIGIN_PROD: &str = "dns.iroh.link";
/// The n0 testing DNS node origin, for testing.
pub const N0_DNS_NODE_ORIGIN_STAGING: &str = "staging-dns.iroh.link";

/// Percent of total delay to jitter. 20 means +/- 20% of delay.
const MAX_JITTER_PERCENT: u64 = 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last remaining comment, now I understand this value: Why did you choose +/- 20%? Isn't that rather big? I think I might have chosen +/- 10%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking 20% was relatively conservative. Looking at this article from aws, they found success with jittering a random value between 0 and the normal delay.

I know the assumptions are probably different between iroh's use case and theirs, so I cam run some tests in a lab environment and see what the impact is to dns server load and request response time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the context. We do retries eagerly as well to deal with slow responses, so we don't quite want to make the jittering between [0..=delay]. I'm happy to give this 20% a try and see what happens.

I suspect in any case that this isn't going to solve the entire issue. As I said in #3017 we need to collect all callsites and think about it with a global view.

@flub flub changed the title fix(iroh): add jitter to dns call timeout made in the stagger_call function fix(iroh): add jitter to dns retry calls Sep 15, 2025
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@flub flub added this pull request to the merge queue Sep 15, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 15, 2025
@flub flub added this pull request to the merge queue Sep 15, 2025
Merged via the queue into n0-computer:main with commit f3da758 Sep 15, 2025
31 checks passed
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in iroh Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants