util: prioritize cancellations in retry loop #154782

sanki92 · 2025-10-03T19:31:55Z

This commit teaches util.Retry to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes.

Fixes: #154764

Release note: None

blathers-crl · 2025-10-03T19:32:02Z

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

Before a member of our team reviews your PR, I have some potential action items for you:

Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

A person you worked with closely on this PR.
The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
Join our community slack channel and ask on #contributors.
Try find someone else from here.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-10-03T19:32:05Z

This change is

cockroachlabs-cla-agent · 2025-10-03T19:32:07Z

All committers have signed the CLA.

yuzefovich · 2025-10-03T19:36:59Z

Thanks for opening a PR! I'm not sure if the fix is quite right, I'll defer to @kev-cao who authored the test for that.

Generally, we don't expect community members to look into test failures, so I'd encourage you to take a look at #41815 as a starting point to find an interesting feature issue to work on.

kev-cao

This LGTM! While this test does use a manually controlled clock, these two particular subtests have some variability due to the select statement randomly picking between the stopped channels versus the clock's timer.

Increasing the timing tolerance like this will decrease the amount of flakes without impacting the correctness of the test, although to fully resolve the issue we'd probably want to look into repeatedly advancing the manual clock by some fraction of the backoff to avoid ties.

sanki92 · 2025-10-03T20:50:11Z

@yuzefovich Thanks for the feedback! I appreciate @kev-cao confirming the fix approach.

I notice the CI failures seem to be infrastructure-related rather than issues with the code changes. Should I wait for these to resolve, or is there anything specific I should address?

Also, thank you for pointing me toward #41815 for feature work - I'll definitely explore those opportunities for more substantial contributions!

kev-cao

Well the CI issues were a little weird since those subtests were failing under duress despite the fact that they should be skipped under duress.

That being said, while I was investigating this, I realized that there is a better solution here that eliminates the need to skip running these tests under duress. In retry.Next, we currently perform a blocking select on the backoff timer, context cancellation, and stopper.

This does mean that with shorter backoffs, context cancellation/stoppers are not prioritized in the event that all three channels are ready. This is the reason why we are running into these flakes.

If we instead do a two-stage select, a non-blocking select on the context/stopper first before running our blocking select, this correctly prioritizes context cancellation/stops and also prevents these flakes entirely and we can delete the code for skipping under duress.

@sanki92 If you'd like to give this a try, you are more than welcome to. Otherwise I can put up a quick PR for it.

sanki92 · 2025-10-03T21:49:31Z

@kev-cao I'd love to give this a try! Thank you for the detailed explanation of the two-stage select approach - it makes perfect sense and is a much more elegant solution than the timing tolerance bandaid.

I'll implement the non-blocking select for context/closer prioritization followed by the blocking select with timer, and remove the duress skipping logic as you described.

This is exactly the kind of meaningful contribution I was hoping to make. I appreciate you taking the time to mentor me through the proper solution!

blathers-crl · 2025-10-03T22:12:54Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

sanki92 · 2025-10-03T22:15:54Z

@kev-cao Implementation complete! Two-stage select with context/closer prioritization is now in place, and duress skipping logic has been removed. Ready for your review!

pkg/util/retry/retry_test.go

blathers-crl · 2025-10-03T22:59:51Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

kev-cao · 2025-10-03T23:03:30Z

pkg/util/retry/retry_test.go

-			// Under duress, closing a channel will not necessarily stop the retry
-			// loop immediately, so we skip this test under duress.
-			skipUnderDuress: true,
+			expectedTimeSpent: time.Millisecond,


Since the context/stopper is canceled before the retry loop, the expected time would be 0 instead of 1.

blathers-crl · 2025-10-03T23:09:42Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

kev-cao · 2025-10-03T23:15:02Z

pkg/util/retry/retry_test.go


 			if tc.expectedTimeSpent != 0 {
 				require.Equal(
 					t, tc.expectedTimeSpent, timeSource.Since(start), "expected time does not match actual spent time",


We'll also need to get rid of the condition or else the tests are essentially a no-op.

Done! Let me know if you spot anything else.

blathers-crl · 2025-10-03T23:20:52Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

kev-cao · 2025-10-03T23:23:47Z

Does the test still pass?

sanki92 · 2025-10-03T23:27:16Z

Can't test locally - need Bazel build system for generated code. Logic should be correct though!

kev-cao · 2025-10-04T00:20:11Z

Hmm, you should be able to build following the instructions here. In any case, the test fails because the two subtests report that 1 millisecond has passed.

blathers-crl · 2025-10-04T05:05:36Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

sanki92 · 2025-10-04T05:35:51Z

@kev-cao Fixed the expected times back to 1ms. Based on your feedback that the test was failing because it reported 1ms elapsed (when we expected 0), I analyzed the code and realized the backingOffHook advances manual time even when cancellation is detected immediately.

Also my system isn't able to handle the build process properly to test locally, but the logic should be correct now.

kev-cao · 2025-10-04T11:42:30Z

I think the implementation needs updating. If the context is canceled before the retry loop ever runs, then no backoff will have been performed.

blathers-crl · 2025-10-04T12:22:21Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

sanki92 · 2025-10-04T12:24:02Z

I've made the changes. Could you check it locally when you get a chance? My setup can't handle the build. Thanks!

This commit teaches `util.Retry` to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes. Fixes: cockroachdb#154764 Release note: None

kev-cao

This LGTM! Updated the commit message and PR to follow our conventions. I'll get one more set of eyes on this before we merge it. Thanks for the contribution!

yuzefovich

Probably worth backporting to 25.4?

@yuzefovich reviewed 1 of 2 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kev-cao and @sanki92)

kev-cao · 2025-10-06T17:16:52Z

bors r=kev-cao,yuzefovich

craig · 2025-10-06T18:15:38Z

Build succeeded:

blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Oct 3, 2025

yuzefovich removed the X-blathers-untriaged blathers was unable to find an owner label Oct 3, 2025

kev-cao approved these changes Oct 3, 2025

View reviewed changes

kev-cao requested changes Oct 3, 2025

View reviewed changes

pkg/util/retry/retry_test.go Outdated Show resolved Hide resolved

pkg/util/retry/retry_test.go Show resolved Hide resolved

kev-cao reviewed Oct 3, 2025

View reviewed changes

util/retry: prioritize cancellations in retry loop

8c3abab

This commit teaches `util.Retry` to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes. Fixes: cockroachdb#154764 Release note: None

kev-cao force-pushed the fix-retry-test-timing-154764 branch from 7a49f1b to 8c3abab Compare October 6, 2025 15:44

kev-cao changed the title ~~util/retry: increase timing tolerance for cancellation tests~~ util: prioritize cancellations in retry loop Oct 6, 2025

kev-cao approved these changes Oct 6, 2025

View reviewed changes

kev-cao requested review from dt and yuzefovich and removed request for dt October 6, 2025 15:48

yuzefovich approved these changes Oct 6, 2025

View reviewed changes

kev-cao added the backport-25.4.x Flags PRs that need to be backported to 25.4 label Oct 6, 2025

craig bot merged commit 52833a4 into cockroachdb:master Oct 6, 2025
27 checks passed

celeste-cockroachdb bot added the target-release-26.1.0 label Oct 6, 2025

blathers-crl bot mentioned this pull request Oct 6, 2025

release-25.4: util: prioritize cancellations in retry loop #154876

Open

util: prioritize cancellations in retry loop #154782

util: prioritize cancellations in retry loop #154782

Conversation

sanki92 commented Oct 3, 2025 • edited by kev-cao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Oct 3, 2025

Uh oh!

cockroach-teamcity commented Oct 3, 2025

Uh oh!

cockroachlabs-cla-agent bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuzefovich commented Oct 3, 2025

Uh oh!

kev-cao left a comment

Choose a reason for hiding this comment

Uh oh!

sanki92 commented Oct 3, 2025

Uh oh!

kev-cao left a comment

Choose a reason for hiding this comment

Uh oh!

sanki92 commented Oct 3, 2025

Uh oh!

blathers-crl bot commented Oct 3, 2025

Uh oh!

sanki92 commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

blathers-crl bot commented Oct 3, 2025

Uh oh!

kev-cao Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

sanki92 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Oct 3, 2025

Uh oh!

kev-cao Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

sanki92 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Oct 3, 2025

Uh oh!

kev-cao commented Oct 3, 2025

Uh oh!

sanki92 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kev-cao commented Oct 4, 2025

Uh oh!

blathers-crl bot commented Oct 4, 2025

Uh oh!

sanki92 commented Oct 4, 2025

Uh oh!

kev-cao commented Oct 4, 2025

Uh oh!

blathers-crl bot commented Oct 4, 2025

Uh oh!

sanki92 commented Oct 4, 2025

Uh oh!

kev-cao left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

kev-cao commented Oct 6, 2025

Uh oh!

craig bot commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

sanki92 commented Oct 3, 2025 •

edited by kev-cao

Loading

cockroachlabs-cla-agent bot commented Oct 3, 2025 •

edited

Loading

sanki92 commented Oct 3, 2025 •

edited

Loading