Skip to content

Conversation

@nimrod-teich
Copy link
Contributor

@nimrod-teich nimrod-teich commented Jan 27, 2026

When quorum is disabled (no lava-quorum-* headers), RelayCountOnNodeError
now acts as a hard limit on the total number of retry batches. Setting
RelayCountOnNodeError=0 effectively disables all retries on node errors.

When quorum is enabled, quorumParams.Max is used as the retry limit instead,
allowing quorum functionality to work independently of RelayCountOnNodeError.

Additionally, introduces a new --set-retry-count-on-protocol-error CLI flag
that controls protocol error retries independently from node error retries.

Changes:

  • Modified retryCondition() in both smart router and consumer state machines
    to respect RelayCountOnNodeError when quorum is disabled
  • Updated HasRequiredNodeResults() to count node errors as "results" when
    RelayCountOnNodeError=0 to prevent quorum logic from triggering retries
  • Added HasRequiredNodeResults() return value for protocol error count
  • Update retryCondition to apply appropriate limit based on error type:
    • Only node errors: use RelayCountOnNodeError
    • Only protocol errors: use RelayCountOnProtocolError
    • Both: use max of both limits
  • Add RelayCountOnProtocolError variable and CLI flags
  • Added comprehensive tests for retry limit behavior

This allows users to disable node error retries (--set-retry-count-on-node-error=0)
while still allowing protocol error retries for transient connection issues.

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
consensus 8.53% <ø> (ø)
protocol ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/common/cobra_common.go 0.00% <ø> (ø)

... and 186 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nimrod-teich nimrod-teich requested review from NadavLevi, Tomelia1999 and avitenzer and removed request for Tomelia1999 January 27, 2026 15:58
NadavLevi
NadavLevi previously approved these changes Jan 27, 2026
@github-actions
Copy link

github-actions bot commented Jan 27, 2026

Test Results

    6 files   -   1     83 suites   - 2   31m 5s ⏱️ -44s
3 031 tests  - 165  3 028 ✅  - 167  1 💤 ±0  0 ❌ ±0  2 🔥 +2 
3 114 runs   - 165  3 111 ✅  - 167  1 💤 ±0  0 ❌ ±0  2 🔥 +2 

For more details on these errors, see this check.

Results for commit e19729e. ± Comparison against base commit 2ae0fd6.

This pull request removes 177 and adds 12 tests. Note that renamed tests count towards both.
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_1
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_1.2
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_2
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_2.5
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_200
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock/Finalized_After_delay_No_Hash
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock/Finalized_After_delay_No_Hash#01
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock/Finalized_After_delay_With_Hash
…
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroBatchRequest
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroBatchRequest/Batch_request_with_node_error_SHOULD_retry_when_RelayCountOnNodeError=2
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroBatchRequest/Batch_request_with_node_error_should_not_retry_when_RelayCountOnNodeError=0
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=0_with_multiple_node_errors_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=0_with_only_node_errors_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=0_with_success_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=2_with_only_node_errors_SHOULD_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=2_with_success_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnProtocolErrorSeparateFromNodeError
…

♻️ This comment has been updated with latest results.

@nimrod-teich nimrod-teich force-pushed the fix/relay-count-on-node-error-retry-limit branch from b1fa4d2 to 6a7f5b3 Compare January 28, 2026 08:57
@nimrod-teich nimrod-teich force-pushed the fix/relay-count-on-node-error-retry-limit branch 2 times, most recently from 4426b64 to 3844817 Compare January 28, 2026 13:50
@nimrod-teich nimrod-teich changed the title fix(relay): enforce RelayCountOnNodeError as hard retry limit feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries Jan 28, 2026
…error retries

When quorum is disabled (no lava-quorum-* headers), RelayCountOnNodeError
now acts as a hard limit on the total number of retry batches. Setting
RelayCountOnNodeError=0 effectively disables all retries on node errors.

When quorum is enabled, quorumParams.Max is used as the retry limit instead,
allowing quorum functionality to work independently of RelayCountOnNodeError.

Additionally, introduces a new `--set-retry-count-on-protocol-error` CLI flag
that controls protocol error retries independently from node error retries.

Changes:
- Modified retryCondition() in both smart router and consumer state machines
  to respect RelayCountOnNodeError when quorum is disabled
- Updated HasRequiredNodeResults() to count node errors as "results" when
  RelayCountOnNodeError=0 to prevent quorum logic from triggering retries
- Added HasRequiredNodeResults() return value for protocol error count
- Update retryCondition to apply appropriate limit based on error type:
  - Only node errors: use RelayCountOnNodeError
  - Only protocol errors: use RelayCountOnProtocolError
  - Both: use max of both limits
- Add RelayCountOnProtocolError variable and CLI flags
- Added comprehensive tests for retry limit behavior

This allows users to disable node error retries (--set-retry-count-on-node-error=0)
while still allowing protocol error retries for transient connection issues.
@nimrod-teich nimrod-teich force-pushed the fix/relay-count-on-node-error-retry-limit branch from 3844817 to 317904a Compare January 28, 2026 13:52
// Quorum feature disabled: check if we have enough results for quorum
retryForQuorumNeeded = !(resultsCount >= neededForQuorum)
// When RelayCountOnNodeError is 0, treat node errors as "results" for quorum purposes
// This ensures that when retries are disabled, we don't retry to replace node errors
Copy link
Contributor

@Tomelia1999 Tomelia1999 Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will still be retry on nodeErrors.
Imagine the following case:
RelayCountOnNodeError = 2

it will go to the ELSE and won't take into account how much nodeErrors there are, only how much resultsCount there are

Updated the retryCondition function in both ConsumerRelayStateMachine and SmartRouterRelayStateMachine to streamline the handling of retry limits. Removed redundant checks for both node and protocol errors, allowing the maximum of both limits to be used in cases of either error type or no errors. This change enhances clarity and maintains the intended functionality of retry behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants