feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries #2201

nimrod-teich · 2026-01-27T15:45:25Z

When quorum is disabled (no lava-quorum-* headers), RelayCountOnNodeError
now acts as a hard limit on the total number of retry batches. Setting
RelayCountOnNodeError=0 effectively disables all retries on node errors.

When quorum is enabled, quorumParams.Max is used as the retry limit instead,
allowing quorum functionality to work independently of RelayCountOnNodeError.

Additionally, introduces a new --set-retry-count-on-protocol-error CLI flag
that controls protocol error retries independently from node error retries.

Changes:

Modified retryCondition() in both smart router and consumer state machines
to respect RelayCountOnNodeError when quorum is disabled
Updated HasRequiredNodeResults() to count node errors as "results" when
RelayCountOnNodeError=0 to prevent quorum logic from triggering retries
Added HasRequiredNodeResults() return value for protocol error count
Update retryCondition to apply appropriate limit based on error type:
- Only node errors: use RelayCountOnNodeError
- Only protocol errors: use RelayCountOnProtocolError
- Both: use max of both limits
Add RelayCountOnProtocolError variable and CLI flags
Added comprehensive tests for retry limit behavior

This allows users to disable node error retries (--set-retry-count-on-node-error=0)
while still allowing protocol error retries for transient connection issues.

Description

Closes: #XXXX

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

read the contribution guide
included the correct type prefix in the PR title, you can find examples of the prefixes below:
confirmed ! in the type prefix if API or client breaking change
targeted the main branch
provided a link to the relevant issue or specification
reviewed "Files changed" and left comments if necessary
included the necessary unit and integration tests
updated the relevant documentation or specification, including comments for documenting Go code
confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

confirmed the correct type prefix in the PR title
confirmed all author checklist items have been addressed
reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

codecov · 2026-01-27T15:52:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag	Coverage Δ
consensus	`8.53% <ø> (ø)`
protocol	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
protocol/common/cobra_common.go	`0.00% <ø> (ø)`

... and 186 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-01-27T16:04:20Z

Test Results

6 files - 1 83 suites - 2 31m 5s ⏱️ -44s
3 031 tests - 165 3 028 ✅ - 167 1 💤 ±0 0 ❌ ±0 2 🔥 +2
3 114 runs - 165 3 111 ✅ - 167 1 💤 ±0 0 ❌ ±0 2 🔥 +2

For more details on these errors, see this check.

Results for commit e19729e. ± Comparison against base commit 2ae0fd6.

This pull request removes 177 and adds 12 tests. Note that renamed tests count towards both.

github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_1
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_1.2
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_2
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_2.5
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheExpirationMultiplier/Multiplier_of_200
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock/Finalized_After_delay_No_Hash
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock/Finalized_After_delay_No_Hash#01
github.com/lavanet/lava/v5/ecosystem/cache ‑ TestCacheFailSetWithInvalidRequestBlock/Finalized_After_delay_With_Hash
…

github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroBatchRequest
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroBatchRequest/Batch_request_with_node_error_SHOULD_retry_when_RelayCountOnNodeError=2
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroBatchRequest/Batch_request_with_node_error_should_not_retry_when_RelayCountOnNodeError=0
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=0_with_multiple_node_errors_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=0_with_only_node_errors_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=0_with_success_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=2_with_only_node_errors_SHOULD_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnNodeErrorZeroDisablesRetries/RelayCountOnNodeError=2_with_success_should_NOT_retry
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestRelayCountOnProtocolErrorSeparateFromNodeError
…

♻️ This comment has been updated with latest results.

…error retries When quorum is disabled (no lava-quorum-* headers), RelayCountOnNodeError now acts as a hard limit on the total number of retry batches. Setting RelayCountOnNodeError=0 effectively disables all retries on node errors. When quorum is enabled, quorumParams.Max is used as the retry limit instead, allowing quorum functionality to work independently of RelayCountOnNodeError. Additionally, introduces a new `--set-retry-count-on-protocol-error` CLI flag that controls protocol error retries independently from node error retries. Changes: - Modified retryCondition() in both smart router and consumer state machines to respect RelayCountOnNodeError when quorum is disabled - Updated HasRequiredNodeResults() to count node errors as "results" when RelayCountOnNodeError=0 to prevent quorum logic from triggering retries - Added HasRequiredNodeResults() return value for protocol error count - Update retryCondition to apply appropriate limit based on error type: - Only node errors: use RelayCountOnNodeError - Only protocol errors: use RelayCountOnProtocolError - Both: use max of both limits - Add RelayCountOnProtocolError variable and CLI flags - Added comprehensive tests for retry limit behavior This allows users to disable node error retries (--set-retry-count-on-node-error=0) while still allowing protocol error retries for transient connection issues.

Tomelia1999 · 2026-01-28T14:35:53Z

protocol/relaycore/relay_processor.go

 			// Quorum feature disabled: check if we have enough results for quorum
-			retryForQuorumNeeded = !(resultsCount >= neededForQuorum)
+			// When RelayCountOnNodeError is 0, treat node errors as "results" for quorum purposes
+			// This ensures that when retries are disabled, we don't retry to replace node errors


There will still be retry on nodeErrors.
Imagine the following case:
RelayCountOnNodeError = 2

it will go to the ELSE and won't take into account how much nodeErrors there are, only how much resultsCount there are

Updated the retryCondition function in both ConsumerRelayStateMachine and SmartRouterRelayStateMachine to streamline the handling of retry limits. Removed redundant checks for both node and protocol errors, allowing the maximum of both limits to be used in cases of either error type or no errors. This change enhances clarity and maintains the intended functionality of retry behavior.

pull-request-size bot added the size/XL label Jan 27, 2026

github-actions bot added C:protocol Team:Protocol labels Jan 27, 2026

nimrod-teich requested review from NadavLevi, Tomelia1999 and avitenzer and removed request for Tomelia1999 January 27, 2026 15:58

NadavLevi previously approved these changes Jan 27, 2026

View reviewed changes

nimrod-teich force-pushed the fix/relay-count-on-node-error-retry-limit branch from b1fa4d2 to 6a7f5b3 Compare January 28, 2026 08:57

nimrod-teich dismissed NadavLevi’s stale review via 4426b64 January 28, 2026 12:26

nimrod-teich force-pushed the fix/relay-count-on-node-error-retry-limit branch 2 times, most recently from 4426b64 to 3844817 Compare January 28, 2026 13:50

pull-request-size bot added size/XXL and removed size/XL labels Jan 28, 2026

nimrod-teich changed the title ~~fix(relay): enforce RelayCountOnNodeError as hard retry limit~~ feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries Jan 28, 2026

nimrod-teich force-pushed the fix/relay-count-on-node-error-retry-limit branch from 3844817 to 317904a Compare January 28, 2026 13:52

pull-request-size bot added size/XL and removed size/XXL labels Jan 28, 2026

Tomelia1999 reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries #2201

feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries #2201

nimrod-teich commented Jan 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Tomelia1999 Jan 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries #2201

Are you sure you want to change the base?

feat(relay): enforce RelayCountOnNodeError and add separate protocol error retries #2201

Conversation

nimrod-teich commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Author Checklist

Reviewers Checklist

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Tomelia1999 Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nimrod-teich commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading

github-actions bot commented Jan 27, 2026 •

edited

Loading

Tomelia1999 Jan 28, 2026 •

edited

Loading