-
Notifications
You must be signed in to change notification settings - Fork 296
fix(iroh): add jitter to dns retry calls #3447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(iroh): add jitter to dns retry calls #3447
Conversation
iroh-relay/src/dns.rs
Outdated
} | ||
} | ||
|
||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need an async test at all, use #[test]
instead of #[tokio::test]
and call add_jitter()
directly in the tests and then you don't need the jitter_test_backend
indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in the edit
stagger_call
func…stagger_call
function
stagger_call
functionstagger_call
function
|
iroh-relay/src/dns.rs
Outdated
@@ -548,4 +566,31 @@ pub(crate) mod tests { | |||
let result = stagger_call(f, &delays).await.unwrap(); | |||
assert_eq!(result, 5) | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also doesn't use any async? Just #[test]
should do.
iroh-relay/src/dns.rs
Outdated
#[traced_test] | ||
fn jitter_test_nonzero_positive() { | ||
let delay: u64 = 300; | ||
for _ in 0..1000000{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test take? I understand that you needed some confidence, but to run each time this might be a bit much. I guess this is a usecase for proptest, but also appreciate that is a lot for this simple change. Maybe run it with something like 10 or 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some zeros because I was paranoid I had done the math wrong and it didn't add any noticeable delay when I ran the tests. I can lower it though to save CI/CD compute
iroh-relay/src/dns.rs
Outdated
//Sanity checks that I did the math right | ||
#[test] | ||
#[traced_test] | ||
fn jitter_test_nonzero_positive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably nitpicking. But the naming of _negative
& _positive
confused me. Maybe _lower_bound
and _upper_bound
are better?
The new way of calculating is a lot easier to follow! Thanks for these changes. |
Could you please also fix the formatting issues? The cargo deny and was32 CI failures are not due to this PR and we'll fix them outside of this. |
|
@@ -28,6 +28,9 @@ pub const N0_DNS_NODE_ORIGIN_PROD: &str = "dns.iroh.link"; | |||
/// The n0 testing DNS node origin, for testing. | |||
pub const N0_DNS_NODE_ORIGIN_STAGING: &str = "staging-dns.iroh.link"; | |||
|
|||
/// Percent of total delay to jitter. 20 means +/- 20% of delay. | |||
const MAX_JITTER_PERCENT: u64 = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last remaining comment, now I understand this value: Why did you choose +/- 20%? Isn't that rather big? I think I might have chosen +/- 10%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking 20% was relatively conservative. Looking at this article from aws, they found success with jittering a random value between 0 and the normal delay.
I know the assumptions are probably different between iroh's use case and theirs, so I cam run some tests in a lab environment and see what the impact is to dns server load and request response time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the context. We do retries eagerly as well to deal with slow responses, so we don't quite want to make the jittering between [0..=delay]. I'm happy to give this 20% a try and see what happens.
I suspect in any case that this isn't going to solve the entire issue. As I said in #3017 we need to collect all callsites and think about it with a global view.
stagger_call
functionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Description
Adds a random jitter to DNS retries called with the
stagger_call
function.Breaking Changes
N/A
Notes & open questions
Further research can determine if additional timeout jitters are needed when calling the other DnsResolver public functions.
Nominal 20% of default delay time as maximum deviation from original timeout delay. Real world and lab testing can dermine if this is sufficient.
Partially addresses #3017.
Change checklist