Skip to content

Conversation

sanki92
Copy link
Contributor

@sanki92 sanki92 commented Oct 3, 2025

This commit teaches util.Retry to prioritize context cancellations and stoppers over retry attempts. This ensures more consistent behaviors and reduces test flakes.

Fixes: #154764

Release note: None

Copy link

blathers-crl bot commented Oct 3, 2025

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

  • A person you worked with closely on this PR.
  • The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
  • Join our community slack channel and ask on #contributors.
  • Try find someone else from here.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Oct 3, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

cockroachlabs-cla-agent bot commented Oct 3, 2025

CLA assistant check
All committers have signed the CLA.

@yuzefovich
Copy link
Member

Thanks for opening a PR! I'm not sure if the fix is quite right, I'll defer to @kev-cao who authored the test for that.

Generally, we don't expect community members to look into test failures, so I'd encourage you to take a look at #41815 as a starting point to find an interesting feature issue to work on.

@yuzefovich yuzefovich removed the X-blathers-untriaged blathers was unable to find an owner label Oct 3, 2025
Copy link
Contributor

@kev-cao kev-cao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! While this test does use a manually controlled clock, these two particular subtests have some variability due to the select statement randomly picking between the stopped channels versus the clock's timer.

Increasing the timing tolerance like this will decrease the amount of flakes without impacting the correctness of the test, although to fully resolve the issue we'd probably want to look into repeatedly advancing the manual clock by some fraction of the backoff to avoid ties.

@sanki92
Copy link
Contributor Author

sanki92 commented Oct 3, 2025

@yuzefovich Thanks for the feedback! I appreciate @kev-cao confirming the fix approach.

I notice the CI failures seem to be infrastructure-related rather than issues with the code changes. Should I wait for these to resolve, or is there anything specific I should address?

Also, thank you for pointing me toward #41815 for feature work - I'll definitely explore those opportunities for more substantial contributions!

Copy link
Contributor

@kev-cao kev-cao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the CI issues were a little weird since those subtests were failing under duress despite the fact that they should be skipped under duress.

That being said, while I was investigating this, I realized that there is a better solution here that eliminates the need to skip running these tests under duress. In retry.Next, we currently perform a blocking select on the backoff timer, context cancellation, and stopper.

This does mean that with shorter backoffs, context cancellation/stoppers are not prioritized in the event that all three channels are ready. This is the reason why we are running into these flakes.

If we instead do a two-stage select, a non-blocking select on the context/stopper first before running our blocking select, this correctly prioritizes context cancellation/stops and also prevents these flakes entirely and we can delete the code for skipping under duress.

@sanki92 If you'd like to give this a try, you are more than welcome to. Otherwise I can put up a quick PR for it.

@sanki92
Copy link
Contributor Author

sanki92 commented Oct 3, 2025

@kev-cao I'd love to give this a try! Thank you for the detailed explanation of the two-stage select approach - it makes perfect sense and is a much more elegant solution than the timing tolerance bandaid.

I'll implement the non-blocking select for context/closer prioritization followed by the blocking select with timer, and remove the duress skipping logic as you described.

This is exactly the kind of meaningful contribution I was hoping to make. I appreciate you taking the time to mentor me through the proper solution!

Copy link

blathers-crl bot commented Oct 3, 2025

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sanki92
Copy link
Contributor Author

sanki92 commented Oct 3, 2025

@kev-cao Implementation complete! Two-stage select with context/closer prioritization is now in place, and duress skipping logic has been removed. Ready for your review!

Copy link

blathers-crl bot commented Oct 3, 2025

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

// Under duress, closing a channel will not necessarily stop the retry
// loop immediately, so we skip this test under duress.
skipUnderDuress: true,
expectedTimeSpent: time.Millisecond,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the context/stopper is canceled before the retry loop, the expected time would be 0 instead of 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ✅

Copy link

blathers-crl bot commented Oct 3, 2025

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.


if tc.expectedTimeSpent != 0 {
require.Equal(
t, tc.expectedTimeSpent, timeSource.Since(start), "expected time does not match actual spent time",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to get rid of the condition or else the tests are essentially a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Let me know if you spot anything else.

Copy link

blathers-crl bot commented Oct 3, 2025

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kev-cao
Copy link
Contributor

kev-cao commented Oct 3, 2025

Does the test still pass?

@sanki92
Copy link
Contributor Author

sanki92 commented Oct 3, 2025

Can't test locally - need Bazel build system for generated code. Logic should be correct though!

@kev-cao
Copy link
Contributor

kev-cao commented Oct 4, 2025

Hmm, you should be able to build following the instructions here. In any case, the test fails because the two subtests report that 1 millisecond has passed.

Copy link

blathers-crl bot commented Oct 4, 2025

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sanki92
Copy link
Contributor Author

sanki92 commented Oct 4, 2025

@kev-cao Fixed the expected times back to 1ms. Based on your feedback that the test was failing because it reported 1ms elapsed (when we expected 0), I analyzed the code and realized the backingOffHook advances manual time even when cancellation is detected immediately.

Also my system isn't able to handle the build process properly to test locally, but the logic should be correct now.
image

@kev-cao
Copy link
Contributor

kev-cao commented Oct 4, 2025

I think the implementation needs updating. If the context is canceled before the retry loop ever runs, then no backoff will have been performed.

Copy link

blathers-crl bot commented Oct 4, 2025

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@sanki92
Copy link
Contributor Author

sanki92 commented Oct 4, 2025

I've made the changes. Could you check it locally when you get a chance? My setup can't handle the build. Thanks!

This commit teaches `util.Retry` to prioritize context cancellations and
stoppers over retry attempts. This ensures more consistent behaviors and
reduces test flakes.

Fixes: cockroachdb#154764

Release note: None
@kev-cao kev-cao force-pushed the fix-retry-test-timing-154764 branch from 7a49f1b to 8c3abab Compare October 6, 2025 15:44
@kev-cao kev-cao changed the title util/retry: increase timing tolerance for cancellation tests util: prioritize cancellations in retry loop Oct 6, 2025
Copy link
Contributor

@kev-cao kev-cao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Updated the commit message and PR to follow our conventions. I'll get one more set of eyes on this before we merge it. Thanks for the contribution!

@kev-cao kev-cao requested review from dt and yuzefovich and removed request for dt October 6, 2025 15:48
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Probably worth backporting to 25.4?

@yuzefovich reviewed 1 of 2 files at r3, 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kev-cao and @sanki92)

@kev-cao kev-cao added the backport-25.4.x Flags PRs that need to be backported to 25.4 label Oct 6, 2025
@kev-cao
Copy link
Contributor

kev-cao commented Oct 6, 2025

bors r=kev-cao,yuzefovich

@craig
Copy link
Contributor

craig bot commented Oct 6, 2025

@craig craig bot merged commit 52833a4 into cockroachdb:master Oct 6, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-25.4.x Flags PRs that need to be backported to 25.4 O-community Originated from the community target-release-26.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

util/retry: TestRetryWithMaxDuration failed
4 participants