-
Notifications
You must be signed in to change notification settings - Fork 225
feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries #2201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 186 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Test Results 6 files - 1 83 suites - 2 31m 5s ⏱️ -44s For more details on these errors, see this check. Results for commit e19729e. ± Comparison against base commit 2ae0fd6. This pull request removes 177 and adds 12 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
b1fa4d2 to
6a7f5b3
Compare
4426b64 to
3844817
Compare
…error retries When quorum is disabled (no lava-quorum-* headers), RelayCountOnNodeError now acts as a hard limit on the total number of retry batches. Setting RelayCountOnNodeError=0 effectively disables all retries on node errors. When quorum is enabled, quorumParams.Max is used as the retry limit instead, allowing quorum functionality to work independently of RelayCountOnNodeError. Additionally, introduces a new `--set-retry-count-on-protocol-error` CLI flag that controls protocol error retries independently from node error retries. Changes: - Modified retryCondition() in both smart router and consumer state machines to respect RelayCountOnNodeError when quorum is disabled - Updated HasRequiredNodeResults() to count node errors as "results" when RelayCountOnNodeError=0 to prevent quorum logic from triggering retries - Added HasRequiredNodeResults() return value for protocol error count - Update retryCondition to apply appropriate limit based on error type: - Only node errors: use RelayCountOnNodeError - Only protocol errors: use RelayCountOnProtocolError - Both: use max of both limits - Add RelayCountOnProtocolError variable and CLI flags - Added comprehensive tests for retry limit behavior This allows users to disable node error retries (--set-retry-count-on-node-error=0) while still allowing protocol error retries for transient connection issues.
3844817 to
317904a
Compare
| // Quorum feature disabled: check if we have enough results for quorum | ||
| retryForQuorumNeeded = !(resultsCount >= neededForQuorum) | ||
| // When RelayCountOnNodeError is 0, treat node errors as "results" for quorum purposes | ||
| // This ensures that when retries are disabled, we don't retry to replace node errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will still be retry on nodeErrors.
Imagine the following case:
RelayCountOnNodeError = 2
it will go to the ELSE and won't take into account how much nodeErrors there are, only how much resultsCount there are
Updated the retryCondition function in both ConsumerRelayStateMachine and SmartRouterRelayStateMachine to streamline the handling of retry limits. Removed redundant checks for both node and protocol errors, allowing the maximum of both limits to be used in cases of either error type or no errors. This change enhances clarity and maintains the intended functionality of retry behavior.
When quorum is disabled (no lava-quorum-* headers), RelayCountOnNodeError
now acts as a hard limit on the total number of retry batches. Setting
RelayCountOnNodeError=0 effectively disables all retries on node errors.
When quorum is enabled, quorumParams.Max is used as the retry limit instead,
allowing quorum functionality to work independently of RelayCountOnNodeError.
Additionally, introduces a new
--set-retry-count-on-protocol-errorCLI flagthat controls protocol error retries independently from node error retries.
Changes:
to respect RelayCountOnNodeError when quorum is disabled
RelayCountOnNodeError=0 to prevent quorum logic from triggering retries
This allows users to disable node error retries (--set-retry-count-on-node-error=0)
while still allowing protocol error retries for transient connection issues.
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...