-
Notifications
You must be signed in to change notification settings - Fork 4k
util: prioritize cancellations in retry loop #154782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. Before a member of our team reviews your PR, I have some potential action items for you:
I was unable to automatically find a reviewer. You can try CCing one of the following members:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Thanks for opening a PR! I'm not sure if the fix is quite right, I'll defer to @kev-cao who authored the test for that. Generally, we don't expect community members to look into test failures, so I'd encourage you to take a look at #41815 as a starting point to find an interesting feature issue to work on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! While this test does use a manually controlled clock, these two particular subtests have some variability due to the select
statement randomly picking between the stopped channels versus the clock's timer.
Increasing the timing tolerance like this will decrease the amount of flakes without impacting the correctness of the test, although to fully resolve the issue we'd probably want to look into repeatedly advancing the manual clock by some fraction of the backoff to avoid ties.
@yuzefovich Thanks for the feedback! I appreciate @kev-cao confirming the fix approach. I notice the CI failures seem to be infrastructure-related rather than issues with the code changes. Should I wait for these to resolve, or is there anything specific I should address? Also, thank you for pointing me toward #41815 for feature work - I'll definitely explore those opportunities for more substantial contributions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the CI issues were a little weird since those subtests were failing under duress despite the fact that they should be skipped under duress.
That being said, while I was investigating this, I realized that there is a better solution here that eliminates the need to skip running these tests under duress. In retry.Next
, we currently perform a blocking select
on the backoff timer, context cancellation, and stopper.
This does mean that with shorter backoffs, context cancellation/stoppers are not prioritized in the event that all three channels are ready. This is the reason why we are running into these flakes.
If we instead do a two-stage select
, a non-blocking select
on the context/stopper first before running our blocking select
, this correctly prioritizes context cancellation/stops and also prevents these flakes entirely and we can delete the code for skipping under duress.
@sanki92 If you'd like to give this a try, you are more than welcome to. Otherwise I can put up a quick PR for it.
@kev-cao I'd love to give this a try! Thank you for the detailed explanation of the two-stage select approach - it makes perfect sense and is a much more elegant solution than the timing tolerance bandaid. I'll implement the non-blocking select for context/closer prioritization followed by the blocking select with timer, and remove the duress skipping logic as you described. This is exactly the kind of meaningful contribution I was hoping to make. I appreciate you taking the time to mentor me through the proper solution! |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@kev-cao Implementation complete! Two-stage select with context/closer prioritization is now in place, and duress skipping logic has been removed. Ready for your review! |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
pkg/util/retry/retry_test.go
Outdated
// Under duress, closing a channel will not necessarily stop the retry | ||
// loop immediately, so we skip this test under duress. | ||
skipUnderDuress: true, | ||
expectedTimeSpent: time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the context/stopper is canceled before the retry loop, the expected time would be 0 instead of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed ✅
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
pkg/util/retry/retry_test.go
Outdated
|
||
if tc.expectedTimeSpent != 0 { | ||
require.Equal( | ||
t, tc.expectedTimeSpent, timeSource.Since(start), "expected time does not match actual spent time", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also need to get rid of the condition or else the tests are essentially a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Let me know if you spot anything else.
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Does the test still pass? |
Can't test locally - need Bazel build system for generated code. Logic should be correct though! |
Hmm, you should be able to build following the instructions here. In any case, the test fails because the two subtests report that 1 millisecond has passed. |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@kev-cao Fixed the expected times back to 1ms. Based on your feedback that the test was failing because it reported 1ms elapsed (when we expected 0), I analyzed the code and realized the backingOffHook advances manual time even when cancellation is detected immediately. Also my system isn't able to handle the build process properly to test locally, but the logic should be correct now. |
I think the implementation needs updating. If the context is canceled before the retry loop ever runs, then no backoff will have been performed. |
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I've made the changes. Could you check it locally when you get a chance? My setup can't handle the build. Thanks! |
This commit teaches `util.Retry` to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes. Fixes: cockroachdb#154764 Release note: None
7a49f1b
to
8c3abab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Updated the commit message and PR to follow our conventions. I'll get one more set of eyes on this before we merge it. Thanks for the contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth backporting to 25.4?
@yuzefovich reviewed 1 of 2 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @kev-cao and @sanki92)
bors r=kev-cao,yuzefovich |
Build succeeded: |
This commit teaches
util.Retry
to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes.Fixes: #154764
Release note: None