fix: Add retry configuration and warning for GEval score=None bug#216
fix: Add retry configuration and warning for GEval score=None bug#216x86girl wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
CRITICAL BUG FIX: Prevent silent score mismatch when LLM judge fails
## Problem Statement
GEval metrics silently convert None scores to 0.0 without warning when
LLM judge evaluation fails. This masks critical infrastructure issues:
- Rate limiting (429 errors) after retries exhausted
- LLM provider timeouts
- Invalid JSON responses from judge
- API quota/credits exhausted
- Network connection failures
Without this fix, failed evaluations appear as valid low scores (0.0),
making debugging nearly impossible and corrupting evaluation data.
## What Changed
### 1. Added configurable retry logic (geval.py)
- New `num_retries` parameter from user config (default: 6)
- Custom retry decorator respecting user settings
- DeepEval hardcodes MAX_RETRIES=6, now we add our own layer
### 2. Added warning when score=None (geval.py:385-397, 515-527)
Before:
```python
score = metric.score if metric.score is not None else 0.0 # Silent!
return score, reason
```
After:
```python
score = metric.score
if score is None:
logger.warning(
"GEval metric returned None score; defaulting to 0.0. "
"This typically indicates LLM judge failure (rate limiting, "
"timeout, invalid JSON, or quota exhausted). Reason: %s",
reason
)
score = 0.0
return score, reason
```
### 3. Updated system.yaml documentation
Added comment explaining num_retries applies to GEval and defaults to 6.
### 4. Fixed test mocks
Updated test fixtures to include llm_params with num_retries to match
new initialization logic.
## Files Changed
- src/lightspeed_evaluation/core/metrics/geval.py
- Add tenacity imports for retry logic
- Add num_retries initialization from config
- Add _is_retryable_exception() method
- Add _measure_with_retry() method
- Add warning logs for None scores (turn & conversation level)
- Replace direct metric.measure() with retry wrapper
- config/system.yaml
- Document num_retries behavior for GEval
- tests/unit/core/metrics/test_geval.py
- Add llm_params mock to fixture
## Testing
- All 833 tests pass
- All pre-commit checks pass (black, pylint, ruff, mypy, bandit, pyright)
- Manual verification of retry logic and warning logs
## Impact
Users can now:
- Configure retry attempts via system.yaml (was hardcoded to 6)
- Immediately see warnings when evaluations fail
- Distinguish between real low scores vs failed evaluations
- Debug rate limiting and infrastructure issues
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WalkthroughThis change introduces retry logic using tenacity to GEval metric evaluation, adds handling for None scores (logging warnings and converting to 0.0), updates configuration comments, and includes a test file to reproduce the GEval score mismatch scenario. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (5)
config/system.yaml (1)
30-30: Clarify the comment to avoid confusion about the default value.The comment says "default: 6 if not specified" which is technically correct (the code defaults to 6 when the key is absent), but could confuse users since the configured value here is
3. Users might wonder if3or6is actually used.Consider rewording to clarify the distinction:
📝 Suggested clarification
- num_retries: 3 # Retry attempts for LLM judge (applies to GEval, custom metrics, etc.; default: 6 if not specified) + num_retries: 3 # Retry attempts for LLM judge (applies to GEval, custom metrics, etc.). Code defaults to 6 when omitted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/system.yaml` at line 30, The inline comment for the num_retries setting is ambiguous about the configured value versus the fallback; update the comment for num_retries to clearly state that the current configured value is 3 and that 6 is the application default used only when the key is omitted (e.g., "num_retries: 3 # Number of retry attempts; currently set to 3. If this key is absent, the system defaults to 6."). Ensure the change targets the num_retries entry so readers understand the distinction between the explicit setting and the implicit default.src/lightspeed_evaluation/core/metrics/geval.py (2)
217-235: Docstring doesn't match implementation - all exceptions are retried.The docstring lists specific "retryable conditions" (rate limiting, timeouts, network failures), but the implementation unconditionally returns
Truefor anyException. While the comment on line 232-234 explains this matches DeepEval's behavior, the docstring creates misleading expectations.Consider updating the docstring to accurately reflect the behavior:
📝 Suggested fix
def _is_retryable_exception(self, exception: BaseException) -> bool: """Check if exception should trigger a retry. - Retryable conditions: - - Rate limiting (429 errors from LLM provider) - - Timeout errors - - Temporary network failures - - LLM provider temporary errors + This method returns True for all Exception subclasses to match + DeepEval's hardcoded behavior (retryable_exceptions = (Exception,)). + + Common retryable scenarios include rate limiting, timeouts, and + network failures, but we retry broadly to handle unknown transient + errors from various LLM providers. Args: exception: The exception to check🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/core/metrics/geval.py` around lines 217 - 235, Update the _is_retryable_exception method docstring to reflect actual behavior: state explicitly that it returns True for any Exception (i.e., it retries on all Exception subclasses), remove or change the specific "Retryable conditions" list to note that DeepEval/LiteLLM handles specific error types internally and that this function mirrors their hardcoded behavior, and adjust the Returns section to say it always returns True for Exception objects; reference _is_retryable_exception and DeepEval/LiteLLM in the text for clarity.
67-70: Consider adding explicit type annotation fornum_retries.The
num_retriesattribute is inferred from.get()which returnsAny. For clarity and type safety, consider adding an explicit annotation:📝 Suggested fix
# Get num_retries from LLM configuration (default: 6 to match DeepEval's hardcoded value) # Note: DeepEval's internal retry logic uses hardcoded MAX_RETRIES=6, # but we add our own retry layer to respect user configuration - self.num_retries = self.deepeval_llm_manager.llm_params.get("num_retries", 6) + self.num_retries: int = self.deepeval_llm_manager.llm_params.get("num_retries", 6)As per coding guidelines: "Include type hints for all public functions and methods in Python" - while this is an instance attribute, explicit typing improves maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/core/metrics/geval.py` around lines 67 - 70, The attribute self.num_retries is inferred as Any from deepeval_llm_manager.llm_params.get; add an explicit int type and ensure the value is cast/validated. Change the assignment to annotate and coerce the value, e.g. set self.num_retries: int = int(self.deepeval_llm_manager.llm_params.get("num_retries", 6)) (or perform a safe int conversion/validation before assigning) so num_retries is explicitly typed as int in the class/constructor where it is defined.bug_test_geval_score_mismatch.yaml (1)
1-11: Update comments to reflect the fix is now in place.The header comments describe the bug as "ACTUAL BEHAVIOR: Silent conversion from None -> 0.0" and "NO WARNING will be shown". However, with this PR merged, warnings WILL be shown. Consider updating the comments to indicate this file now serves as a regression test rather than a bug reproduction:
📝 Suggested update
-# BUG REPRODUCTION: GEval Score Mismatch Without Warning -# This configuration demonstrates the bug where GEval metrics silently default -# None scores to 0.0 without any warning or logging. +# REGRESSION TEST: GEval Score None Handling +# This configuration tests that GEval metrics properly warn when score is None +# before defaulting to 0.0. # # HOW TO REPRODUCE: -# 1. Run: lightspeed-evaluation --system-config config/system.yaml --eval-config bug_test_geval_score_mismatch.yaml -# 2. Observe the output - if GEval returns None score, it will silently become 0.0 -# 3. Check logs - NO WARNING will be shown about the score mismatch +# 1. Run: lightspeed-evaluation --system-config config/system.yaml --eval-config bug_test_geval_score_mismatch.yaml +# 2. If GEval returns None score, it will become 0.0 +# 3. Check logs - A WARNING should be shown about the None score # -# EXPECTED BEHAVIOR: A warning should be logged when score is None -# ACTUAL BEHAVIOR: Silent conversion from None -> 0.0 +# EXPECTED BEHAVIOR: Warning logged when score is None, then defaults to 0.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bug_test_geval_score_mismatch.yaml` around lines 1 - 11, Update the top-of-file comments to reflect that the fix is merged and this YAML now acts as a regression test: change "ACTUAL BEHAVIOR: Silent conversion from None -> 0.0" and "NO WARNING will be shown" to state that warnings are now logged when GEval returns None scores, rename the header to indicate "REGRESSION TEST" instead of "BUG REPRODUCTION", and adjust the HOW TO REPRODUCE/EXPECTED BEHAVIOR sections to instruct running lightspeed-evaluation and verifying that a warning is emitted when GEval returns None (so the file documents the fixed behavior and test purpose).tests/unit/core/metrics/test_geval.py (1)
360-398: Consider adding tests for retry behavior and warning logs.The existing
test_evaluate_turn_none_score_returns_zeroverifies the None→0.0 conversion, but there's no explicit verification that:
- The retry logic actually retries on transient failures
- A warning is logged when score is None
These would strengthen confidence in the new retry/warning functionality.
Would you like me to help generate additional test cases for retry behavior and warning log verification?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/core/metrics/test_geval.py` around lines 360 - 398, Add two assertions/tests alongside test_evaluate_turn_none_score_returns_zero to (1) simulate transient failures and verify retry behavior by having the patched GEval (mocker.patch("lightspeed_evaluation.core.metrics.geval.GEval")) raise a transient exception on the first call and return a valid metric on a subsequent call, then call GEvalHandler.evaluate and assert the final score/reason come from the successful retry; and (2) verify a warning is logged when the metric score is None by patching or capturing the handler's logger (or using caplog) around GEvalHandler.evaluate and asserting logger.warning was called with the expected message when mock_metric.score is None. Ensure you reference GEval, GEvalHandler.evaluate, mock_metric_manager.get_metric_metadata, and the existing test fixture names (turn_data, conv_data) when locating where to add these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bug_test_geval_score_mismatch.yaml`:
- Around line 1-11: Update the top-of-file comments to reflect that the fix is
merged and this YAML now acts as a regression test: change "ACTUAL BEHAVIOR:
Silent conversion from None -> 0.0" and "NO WARNING will be shown" to state that
warnings are now logged when GEval returns None scores, rename the header to
indicate "REGRESSION TEST" instead of "BUG REPRODUCTION", and adjust the HOW TO
REPRODUCE/EXPECTED BEHAVIOR sections to instruct running lightspeed-evaluation
and verifying that a warning is emitted when GEval returns None (so the file
documents the fixed behavior and test purpose).
In `@config/system.yaml`:
- Line 30: The inline comment for the num_retries setting is ambiguous about the
configured value versus the fallback; update the comment for num_retries to
clearly state that the current configured value is 3 and that 6 is the
application default used only when the key is omitted (e.g., "num_retries: 3 #
Number of retry attempts; currently set to 3. If this key is absent, the system
defaults to 6."). Ensure the change targets the num_retries entry so readers
understand the distinction between the explicit setting and the implicit
default.
In `@src/lightspeed_evaluation/core/metrics/geval.py`:
- Around line 217-235: Update the _is_retryable_exception method docstring to
reflect actual behavior: state explicitly that it returns True for any Exception
(i.e., it retries on all Exception subclasses), remove or change the specific
"Retryable conditions" list to note that DeepEval/LiteLLM handles specific error
types internally and that this function mirrors their hardcoded behavior, and
adjust the Returns section to say it always returns True for Exception objects;
reference _is_retryable_exception and DeepEval/LiteLLM in the text for clarity.
- Around line 67-70: The attribute self.num_retries is inferred as Any from
deepeval_llm_manager.llm_params.get; add an explicit int type and ensure the
value is cast/validated. Change the assignment to annotate and coerce the value,
e.g. set self.num_retries: int =
int(self.deepeval_llm_manager.llm_params.get("num_retries", 6)) (or perform a
safe int conversion/validation before assigning) so num_retries is explicitly
typed as int in the class/constructor where it is defined.
In `@tests/unit/core/metrics/test_geval.py`:
- Around line 360-398: Add two assertions/tests alongside
test_evaluate_turn_none_score_returns_zero to (1) simulate transient failures
and verify retry behavior by having the patched GEval
(mocker.patch("lightspeed_evaluation.core.metrics.geval.GEval")) raise a
transient exception on the first call and return a valid metric on a subsequent
call, then call GEvalHandler.evaluate and assert the final score/reason come
from the successful retry; and (2) verify a warning is logged when the metric
score is None by patching or capturing the handler's logger (or using caplog)
around GEvalHandler.evaluate and asserting logger.warning was called with the
expected message when mock_metric.score is None. Ensure you reference GEval,
GEvalHandler.evaluate, mock_metric_manager.get_metric_metadata, and the existing
test fixture names (turn_data, conv_data) when locating where to add these
assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4dfcca80-a393-4ccd-bd2d-6dd684b93fdd
📒 Files selected for processing (4)
bug_test_geval_score_mismatch.yamlconfig/system.yamlsrc/lightspeed_evaluation/core/metrics/geval.pytests/unit/core/metrics/test_geval.py
|
The reasons of the above changes are if the llm fails to grab the content, before, it just changed the score to 0, without throwing any warning: |
VladimirKadlec
left a comment
There was a problem hiding this comment.
Thanks 👍
Please remove the bug_test_geval_score_mismatch.yaml file from the PR. Some test would be nice if you are able to mock the error.
|
@x86girl Thank you !! |
BUG FIX: Prevent silent score mismatch when LLM judge fails
Problem Statement
GEval metrics silently convert None scores to 0.0 without warning when LLM judge evaluation fails. This masks critical infrastructure issues:
Without this fix, failed evaluations appear as valid low scores (0.0), making debugging nearly impossible and corrupting evaluation data.
What Changed
1. Added configurable retry logic (geval.py)
num_retriesparameter from user config (default: 6)2. Added warning when score=None (geval.py:385-397, 515-527)
Before:
After:
3. Updated system.yaml documentation
Added comment explaining num_retries applies to GEval and defaults to 6.
4. Fixed test mocks
Updated test fixtures to include llm_params with num_retries to match new initialization logic.
Files Changed
src/lightspeed_evaluation/core/metrics/geval.py
config/system.yaml
tests/unit/core/metrics/test_geval.py
Testing
Impact
Users can now:
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
Bug Fixes
Tests