Skip to content

Conversation

tnull
Copy link
Contributor

@tnull tnull commented Aug 19, 2025

Based on #38.

We enable reqwest client-level timeouts:

While the `RetryPolicy` has a `MaxTotalDelayRetryPolicy`, the retry
`loop` would only check this configured delay once the operation future
actually returns a value. However, without client-side timeouts, we're
not super sure the operation is actually guaranteed to return anything
(even an error, IIUC).

So here, we enable some coarse client-side default timeouts to ensure
the polled futures eventualy return either the response *or* an error we
can handle via our retry logic.

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default timeout. Hence also pinging @jkczyz who reviewed the original retry PR.

tnull added 4 commits August 19, 2025 11:42
.. to avoid the warning
While the `RetryPolicy` has a `MaxTotalDelayRetryPolicy`, the retry
`loop` would only check this configured delay once the operation future
actually returns a value. However, without client-side timeouts, we're
not super sure the operation is actually guaranteed to return anything
(even an error, IIUC).

So here, we enable some coarse client-side default timeouts to ensure
the polled futures eventualy return either the response *or* an error we
can handle via our retry logic.
@tnull tnull requested review from jkczyz and tankyleo August 19, 2025 09:45
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Aug 19, 2025

👋 Thanks for assigning @tankyleo as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull changed the title 2025 08 enable client side delays Enable client-side delays Aug 19, 2025
@tnull tnull changed the title Enable client-side delays Enable client-side timeouts Aug 19, 2025
Comment on lines +36 to +38
.timeout(DEFAULT_TIMEOUT)
.connect_timeout(DEFAULT_TIMEOUT)
.read_timeout(DEFAULT_TIMEOUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you seems like we could just do with the single global timeout here, no need for connect and read ?

But we can leave it as is and potentially tweak the inner timeouts later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, no strong opinion here.

@tankyleo
Copy link
Contributor

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default delay. Hence also pinging @jkczyz who reviewed the original retry PR.

For reference original PR is #20. On first impression, I'd be in favor of the drop myself.

@jkczyz
Copy link

jkczyz commented Aug 20, 2025

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default delay. Hence also pinging @jkczyz who reviewed the original retry PR.

For reference original PR is #20. On first impression, I'd be in favor of the drop myself.

Do the added timeouts apply to a single operation? If it is never exceeded, wouldn't we still want MaxTotalDelayRetryPolicy to allow limiting retries to a maximum amount of time?

What is meant by client-side default delay?

@tnull
Copy link
Contributor Author

tnull commented Aug 20, 2025

Do the added timeouts apply to a single operation?

Yes, they apply for a single read, but also for connecting / detecting dropped connections AFAIU.

If it is never exceeded, wouldn't we still want MaxTotalDelayRetryPolicy to allow limiting retries to a maximum amount of time?

Yes, it could be useful, but of course its somewhat redundant if we set a client-side timeout and limit the number of retries. It could therefore be a bit confusing if somebody configures the MaxTotalDelayRetryPolicy, but still the other total delay applies if its lesser (i.e., number of retries times timeout).

What is meant by client-side default delay?

Ah, sorry, that was a typo I only corrected in the PR title: should have said default timeout, not delay.

@jkczyz
Copy link

jkczyz commented Aug 20, 2025

Yes, it could be useful, but of course its somewhat redundant if we set a client-side timeout and limit the number of retries.

Do you mean MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>>?

It could therefore be a bit confusing if somebody configures the MaxTotalDelayRetryPolicy, but still the other total delay applies if its lesser (i.e., number of retries times timeout).

Isn't this already the case when configured as I mentioned above? Number of attempts takes priority over total delay given the way MaxTotalDelayRetryPolicy is written.

Maybe I'm confused about what is lesser in that example.

@tnull
Copy link
Contributor Author

tnull commented Aug 21, 2025

Do you mean MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>>?

Yes, if each client call is also limited by a timeout, then we'd have either timeout*MaxAttemptsRetryPolicy or MaxTotalDelayRetryPolicy being the limiting factor.

Maybe I'm confused about what is lesser in that example.

Say you configure MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>> with 5 retries and a total delay of 100 seconds (for the sake of this example). Then you'd expect the client to return either in case of success or after it tried 5 times or after 100s, whatever comes first. Now with the client-side timeouts we also have each retry timeout after 10 seconds, so it might already be done after 50s.

Or, maybe even a bit more confusing would be if the user configured a MaxTotalDelayRetryPolicy of less than 10s, say 5s. They would expect a client call def. return after that 5s. But, if we have a client-side timeout of 10s (or none before this PR), even the first attempt could take way longer than the configured total delay (since we don't use a select but rather a loop, we always await the first/current operation to return).

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@jkczyz
Copy link

jkczyz commented Aug 21, 2025

Say you configure MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>> with 5 retries and a total delay of 100 seconds (for the sake of this example). Then you'd expect the client to return either in case of success or after it tried 5 times or after 100s, whatever comes first. Now with the client-side timeouts we also have each retry timeout after 10 seconds, so it might already be done after 50s.

Isn't that expected? "done after 50s" is really "done after 5 attempts".

Or, maybe even a bit more confusing would be if the user configured a MaxTotalDelayRetryPolicy of less than 10s, say 5s. They would expect a client call def. return after that 5s. But, if we have a client-side timeout of 10s (or none before this PR), even the first attempt could take way longer than the configured total delay (since we don't use a select but rather a loop, we always await the first/current operation to return).

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

@tnull
Copy link
Contributor Author

tnull commented Aug 21, 2025

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

True, seems like we should? And given that we already use tokio with the time feature, it should be straightforward. I think I'll add a commit to this PR.

@tnull
Copy link
Contributor Author

tnull commented Aug 22, 2025

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

True, seems like we should? And given that we already use tokio with the time feature, it should be straightforward. I think I'll add a commit to this PR.

Argh, after looking into it for a bit I have to eat my words: it's actually not trivial, as currently RetryPolicy::next_delay requires us to supply the returned error, i.e., we can only calculate the next delay based on the error type (as we use it in FilteredRetryPolicy).

And more generally, with a generic error type, we don't know what error we'd return in case the the timeout happens before the operation future resolves.

@jkczyz
Copy link

jkczyz commented Aug 22, 2025

And more generally, with a generic error type, we don't know what error we'd return in case the the timeout happens before the operation future resolves.

Would it help defining an enum parameterized by the error E where one variant is for a timeout and the other for wrapping E?

@ldk-reviews-bot
Copy link

🔔 2nd Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 3rd Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 4th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants