Skip to content

feat: add metrics CLI filter, skip field, retry improvements, and infer endpoint#222

Open
Lifto wants to merge 6 commits intolightspeed-core:mainfrom
Lifto:upstream-pr
Open

feat: add metrics CLI filter, skip field, retry improvements, and infer endpoint#222
Lifto wants to merge 6 commits intolightspeed-core:mainfrom
Lifto:upstream-pr

Conversation

@Lifto
Copy link
Copy Markdown

@Lifto Lifto commented Apr 23, 2026

Summary

  • --metrics CLI flag: Filter which metrics run during evaluation (e.g. --metrics custom:answer_correctness to skip expensive metrics)
  • skip field for conversations: Add skip: true and skip_reason to eval YAML to disable conversations without deleting them
  • Retry all server errors: Broaden retry from HTTP 429 only to include 5xx server errors with exponential backoff
  • /infer endpoint support: Add RLSAPI /v1/infer endpoint type with metadata extraction (tool_calls, rag_chunks, token counts)
  • Default retries increased to 5: More resilience for long evaluation runs against transient failures

Testing

  • 75 API tests passing
  • All pre-commit checks pass (pyright 0 errors on changed files; 2 pre-existing errors in upstream's test_validator.py unrelated to this PR)
  • /infer endpoint tested live against local RHEL Lightspeed stack

Notes

These features were developed on a fork (emac-E/lightspeed-evaluation) and cherry-picked/ported to upstream's current main. The retry and infer changes were manually ported due to structural divergence in client.py.

Summary by CodeRabbit

  • New Features

    • Support for a new "infer" API endpoint
    • Conversation-level skip flag with optional skip reason
    • New --metrics CLI option to filter evaluation data by metric IDs
  • Improvements

    • Retry behavior broadened to cover HTTP 429 and transient 5xx server errors
    • Improved retry failure messaging to reference retryable server errors (429/5xx)

Lifto added 5 commits April 22, 2026 15:53
Allows running a subset of configured metrics without editing YAML configs.
Example: --metrics custom:answer_correctness to skip RAGAS metrics.
Adds skip and skip_reason fields to EvaluationData. Conversations
with skip: true are silently excluded during loading. skip_reason
is documentation-only — it stays in the YAML for humans to read.
Broaden retry logic from HTTP 429 only to include 5xx server errors,
enabling automatic retry with exponential backoff for transient
server failures.

Add RLSAPI /v1/infer endpoint support for tool call and RAG chunk
metadata retrieval, used by RHEL Lightspeed backend testing.
Provides more resilience against transient server failures,
especially during long evaluation runs.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Walkthrough

Adds an "infer" endpoint to the API client, broadens retry logic to include HTTP 429 and select 5xx errors, introduces conversation-level skip fields, and adds metric-based filtering in evaluation data loading plus CLI support and tests.

Changes

Cohort / File(s) Summary
API Retry & Infer Endpoint
src/lightspeed_evaluation/core/api/client.py, src/lightspeed_evaluation/core/constants.py, src/lightspeed_evaluation/core/models/system.py
Replaces 429-only retry predicate with _is_retryable_server_error (429 + selected 5xx), updates tenacity retry usage and retry-exhaust message, adds endpoint_type == "infer" handling and _rlsapi_infer_query that posts to /api/lightspeed/{version}/infer, transforms request (question + metadata), maps response to APIResponse, pairs tool_call ids with tool_results, derives RAG chunks from tool_results content, and adds "infer" to supported endpoint types.
Data Model & Validation
src/lightspeed_evaluation/core/models/data.py, src/lightspeed_evaluation/core/system/validator.py
Adds skip: bool and skip_reason: Optional[str] to EvaluationData; DataValidator.load_evaluation_data(...) signature gains metrics: Optional[list[str]] and now filters out skipped conversations and restricts/derives turn_metrics and conversation_metrics to the provided metrics set using defaults when available.
CLI Updates
src/lightspeed_evaluation/runner/evaluation.py
Adds --metrics CLI argument and forwards it to DataValidator.load_evaluation_data(...).
Test Fixtures
tests/unit/core/api/conftest.py
Adds basic_api_config_infer_endpoint pytest fixture for the infer endpoint.
Test Coverage
tests/unit/core/api/test_client.py, tests/unit/core/system/test_validator.py, tests/unit/runner/test_evaluation.py
Extends tests to cover _is_retryable_server_error behavior (429/5xx retryable, other 4xx/non-HTTP not), retry success scenarios, new TestInferEndpoint validating request body/URL, response parsing (tokens, tool_calls/results, RAG chunks), infer error conditions/timeouts, skip-flag filtering, metric-based defaulting/filtering, and updates to evaluation arg tests to include metrics.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(0,128,255,0.5)
    actor Client
    end
    rect rgba(0,200,0,0.5)
    participant APIClient as APIClient\nquery()
    participant Transport as HTTP Transport
    participant InferAPI as /infer\nEndpoint
    participant Parser as Response\nParser
    end

    Client->>APIClient: query(endpoint_type="infer", input)
    APIClient->>APIClient: _rlsapi_infer_query()\nbuild request (question + metadata)
    APIClient->>Transport: POST /api/lightspeed/{version}/infer
    Transport->>InferAPI: forward request
    InferAPI-->>Transport: response JSON
    Transport-->>APIClient: response object
    APIClient->>Parser: parse response.data
    Parser->>Parser: map to APIResponse\nmatch tool_calls -> tool_results\nextract RAG chunks
    Parser-->>APIClient: APIResponse
    APIClient-->>Client: return APIResponse
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title comprehensively covers all major changes: metrics CLI filter, skip field functionality, retry improvements, and the new infer endpoint.
Docstring Coverage ✅ Passed Docstring coverage is 97.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/core/models/system.py (1)

272-275: ⚠️ Potential issue | 🟡 Minor

Stale field description — still says "streaming or query".

Now that "infer" is a supported endpoint type, update the endpoint_type Field description to include it (e.g. "API endpoint type (streaming, query, or infer)").

Proposed fix
     endpoint_type: str = Field(
         default=DEFAULT_ENDPOINT_TYPE,
-        description="API endpoint type (streaming or query)",
+        description="API endpoint type (streaming, query, or infer)",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/models/system.py` around lines 272 - 275, The
Field description for endpoint_type (the endpoint_type attribute using Field
with default DEFAULT_ENDPOINT_TYPE) is outdated; update its description string
to list the new supported value by changing "API endpoint type (streaming or
query)" to include "infer" (e.g., "API endpoint type (streaming, query, or
infer)") so the Field accurately documents supported endpoint types.
tests/unit/core/system/test_validator.py (1)

1-18: ⚠️ Potential issue | 🔴 Critical

Add missing MockerFixture import — module will fail at import time.

The test methods on lines 567 and 592 use mocker: MockerFixture type annotations, but MockerFixture is never imported. Without from __future__ import annotations, Python evaluates function annotations at definition time, causing NameError: name 'MockerFixture' is not defined during pytest collection.

Proposed fix
 import pytest

 from pydantic import ValidationError
+from pytest_mock import MockerFixture

 from lightspeed_evaluation.core.models import EvaluationData, SystemConfig, TurnData
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/core/system/test_validator.py` around lines 1 - 18, Add the
missing type import for MockerFixture so annotations like "mocker:
MockerFixture" in the test functions that use the mocker fixture resolve at
import time; specifically, add "from pytest_mock import MockerFixture" alongside
the other imports at the top of the test file so tests referencing MockerFixture
(the functions annotated with mocker: MockerFixture) no longer raise NameError
during collection.
🧹 Nitpick comments (2)
tests/unit/core/system/test_validator.py (1)

567-606: Tests misplaced: belong in TestDataValidator, not TestFilterByScope.

These two tests exercise DataValidator.load_evaluation_data (skip-filtering behavior), not _filter_by_scope. Placing them under TestFilterByScope is misleading and breaks the file's “mirror-behavior-under-test” structure. Consider moving them into TestDataValidator (which already contains test_load_evaluation_data_* tests) or into a dedicated TestLoadEvaluationDataSkip class.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/core/system/test_validator.py` around lines 567 - 606, The two
tests test_skip_removes_conversation and test_skip_false_keeps_conversation
belong with tests for DataValidator.load_evaluation_data, not scope filtering;
move their definitions out of the TestFilterByScope class and place them inside
the TestDataValidator class (or create a new TestLoadEvaluationDataSkip class)
so they mirror the unit-under-test (DataValidator.load_evaluation_data) and keep
file structure consistent; ensure any class-level fixtures/imports used by
TestFilterByScope are available in the new location and update the test class
name if you create a new dedicated class.
src/lightspeed_evaluation/core/api/client.py (1)

30-42: Verify 5xx retries are safe for POST query endpoints.

The new predicate retries all 5xx responses from POST-based query paths. If the server processes a turn/tool call and then returns 5xx, this can replay side effects. Please confirm these endpoints are idempotent, or add a stable idempotency/request key before retrying 5xx responses.

Also applies to: 71-80, 297-300, 324-327, 436-439

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/api/client.py` around lines 30 - 42, The
_is_retryable_server_error predicate currently retries all 5xx responses
unconditionally which can replay side-effects for POST-based query endpoints;
update _is_retryable_server_error (and the same logic at the other occurrences)
to either (a) only treat 5xx as retryable for safe idempotent HTTP methods
(check exception.response.request.method and allow retries for
GET/HEAD/OPTIONS/PUT when appropriate), or (b) require and verify a stable
idempotency/request key header (e.g., "Idempotency-Key") on the original request
before returning True for 5xx so POSTs are only retried when an idempotency key
is present. Ensure you reference exception.response.request.method and
exception.response.request.headers when implementing this change so POSTs
without an idempotency key are not retried.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 416-428: The code currently assigns only matching_result["status"]
to formatted_tool["result"], discarding the actual tool output; update the block
that finds matching_result (using response_data, tool_call, formatted_tool,
formatted_tool_calls) to attach the full payload: set formatted_tool["result"] =
matching_result.get("content") (or matching_result["content"] if guaranteed) and
also preserve status separately e.g. formatted_tool["status"] =
matching_result.get("status"); leave
formatted_tool_calls.append([formatted_tool]) as-is.
- Around line 360-364: The debug log currently prints the entire infer_request
(logger.debug("RLSAPI infer request body: %s", infer_request)), exposing the
user prompt; update the logging so it only emits non-sensitive metadata: log
self.config.version, infer_request.get("include_metadata") and the prompt length
(e.g., len(infer_request.get("prompt",""))), and remove or redact the raw prompt
text from the logger.debug call; ensure you keep the existing logger.debug for
the URL (using self.config.version) and replace the full-body log with the
sanitized metadata fields.
- Around line 386-393: The code currently overwrites response_data["rag_chunks"]
for each mcp_call so only the last tool result is kept; fix it by initializing
response_data["rag_chunks"] = [] if missing, then for each result where
result.get("type") == "mcp_call" split result["content"] and extend (not assign)
response_data["rag_chunks"] with {"content": chunk} entries (filtering out
empty/whitespace chunks) so all MCP tool results accumulate rather than replace
previous ones.
- Around line 354-358: The current logic in the block that builds infer_request
(using api_request.model_dump -> request_data and then setting infer_request =
{"question": request_data.pop("query"), "include_metadata": True}) silently
drops prepared fields like extra_request_params when endpoint_type == "infer";
update the construction in the function that handles endpoint_type == "infer" to
explicitly preserve or merge only allowed keys from request_data (e.g., merge
request_data.get("extra_request_params", {}) or filter request_data by an
allowed list) into infer_request instead of discarding request_data, referencing
the variables request_data, infer_request and the api_request.model_dump call to
locate the code.

In `@src/lightspeed_evaluation/core/constants.py`:
- Around line 59-62: DEFAULT_API_NUM_RETRIES is still set to 3 while
DEFAULT_LLM_RETRIES was bumped to 5, so API calls using APIConfig.num_retries
are not getting the intended increased retry behavior; update
DEFAULT_API_NUM_RETRIES to 5 to match DEFAULT_LLM_RETRIES (or explicitly
document/choose a different policy) so the API client's retry decorator uses the
new default number of retries.

---

Outside diff comments:
In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 272-275: The Field description for endpoint_type (the
endpoint_type attribute using Field with default DEFAULT_ENDPOINT_TYPE) is
outdated; update its description string to list the new supported value by
changing "API endpoint type (streaming or query)" to include "infer" (e.g., "API
endpoint type (streaming, query, or infer)") so the Field accurately documents
supported endpoint types.

In `@tests/unit/core/system/test_validator.py`:
- Around line 1-18: Add the missing type import for MockerFixture so annotations
like "mocker: MockerFixture" in the test functions that use the mocker fixture
resolve at import time; specifically, add "from pytest_mock import
MockerFixture" alongside the other imports at the top of the test file so tests
referencing MockerFixture (the functions annotated with mocker: MockerFixture)
no longer raise NameError during collection.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 30-42: The _is_retryable_server_error predicate currently retries
all 5xx responses unconditionally which can replay side-effects for POST-based
query endpoints; update _is_retryable_server_error (and the same logic at the
other occurrences) to either (a) only treat 5xx as retryable for safe idempotent
HTTP methods (check exception.response.request.method and allow retries for
GET/HEAD/OPTIONS/PUT when appropriate), or (b) require and verify a stable
idempotency/request key header (e.g., "Idempotency-Key") on the original request
before returning True for 5xx so POSTs are only retried when an idempotency key
is present. Ensure you reference exception.response.request.method and
exception.response.request.headers when implementing this change so POSTs
without an idempotency key are not retried.

In `@tests/unit/core/system/test_validator.py`:
- Around line 567-606: The two tests test_skip_removes_conversation and
test_skip_false_keeps_conversation belong with tests for
DataValidator.load_evaluation_data, not scope filtering; move their definitions
out of the TestFilterByScope class and place them inside the TestDataValidator
class (or create a new TestLoadEvaluationDataSkip class) so they mirror the
unit-under-test (DataValidator.load_evaluation_data) and keep file structure
consistent; ensure any class-level fixtures/imports used by TestFilterByScope
are available in the new location and update the test class name if you create a
new dedicated class.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 976ba54f-6f49-49c9-a115-1035789edb3c

📥 Commits

Reviewing files that changed from the base of the PR and between 613addc and ce00854.

📒 Files selected for processing (9)
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • tests/unit/core/api/conftest.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/system/test_validator.py

Comment on lines +354 to +358
request_data = api_request.model_dump(exclude_none=True)
infer_request: dict[str, object] = {
"question": request_data.pop("query"),
"include_metadata": True,
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t drop prepared request parameters for /infer.

Line 356 sends only question and include_metadata, so extra_request_params and other prepared request fields are ignored for endpoint_type == "infer". If /infer supports only a subset, filter explicitly instead of silently discarding everything.

Proposed fix
-            request_data = api_request.model_dump(exclude_none=True)
+            request_data = self._serialize_request(api_request)
+            question = request_data.pop("query")
             infer_request: dict[str, object] = {
-                "question": request_data.pop("query"),
+                **request_data,
+                "question": question,
                 "include_metadata": True,
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/api/client.py` around lines 354 - 358, The
current logic in the block that builds infer_request (using
api_request.model_dump -> request_data and then setting infer_request =
{"question": request_data.pop("query"), "include_metadata": True}) silently
drops prepared fields like extra_request_params when endpoint_type == "infer";
update the construction in the function that handles endpoint_type == "infer" to
explicitly preserve or merge only allowed keys from request_data (e.g., merge
request_data.get("extra_request_params", {}) or filter request_data by an
allowed list) into infer_request instead of discarding request_data, referencing
the variables request_data, infer_request and the api_request.model_dump call to
locate the code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added a code comment explaining that extra_request_params are not forwarded to /infer since the endpoint only accepts question and include_metadata. Other params (model, provider, etc.) are not part of the RLSAPI /infer API contract.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

Comment thread src/lightspeed_evaluation/core/api/client.py Outdated
Comment thread src/lightspeed_evaluation/core/api/client.py Outdated
Comment thread src/lightspeed_evaluation/core/api/client.py
Comment on lines +59 to 62
SUPPORTED_ENDPOINT_TYPES = ["streaming", "query", "infer"]
DEFAULT_API_CACHE_DIR = ".caches/api_cache"

DEFAULT_API_NUM_RETRIES = 3
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent retry default: API retries still 3 while LLM retries bumped to 5.

The PR description states default retry attempts were increased from 3 to 5, but DEFAULT_API_NUM_RETRIES (line 62) is still 3 while only DEFAULT_LLM_RETRIES (line 73) was updated. Note that the API client's retry decorator uses APIConfig.num_retries (which defaults to DEFAULT_API_NUM_RETRIES), so the advertised retry bump does not actually reach the new 429/5xx retry path for API calls. Confirm whether the API default should also be raised to 5 for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/constants.py` around lines 59 - 62,
DEFAULT_API_NUM_RETRIES is still set to 3 while DEFAULT_LLM_RETRIES was bumped
to 5, so API calls using APIConfig.num_retries are not getting the intended
increased retry behavior; update DEFAULT_API_NUM_RETRIES to 5 to match
DEFAULT_LLM_RETRIES (or explicitly document/choose a different policy) so the
API client's retry decorator uses the new default number of retries.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — reverted DEFAULT_LLM_RETRIES back to 3. As asamal4 noted, this is the judge LLM default, not API retry. Actual values are configurable per-config.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !! I have added few comments + there are some valid coderabbit comments. PTAL..

DEFAULT_LLM_TEMPERATURE = 0.0
DEFAULT_LLM_MAX_TOKENS = 512
DEFAULT_LLM_RETRIES = 3
DEFAULT_LLM_RETRIES = 5
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for JudgeLLM, not for API call.. Also this is just the default value, actual value can be set in config as per the need. Do we really need to change this ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, reverted to 3. Will configure via eval config when needed.

if not isinstance(exception, httpx.HTTPStatusError):
return False
status = exception.response.status_code
return status == 429 or 500 <= status < 600
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of range, could you please write the exact status code like 502, 503..

I understand the purpose, but at the same time I am concerned that we will end up wasting retries.
In future we will probably make it configurable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — narrowed to status in (429, 502, 503, 504). Only clearly transient errors: 429=rate limit, 502=bad gateway, 503=unavailable, 504=gateway timeout. 500 excluded since it can indicate permanent server bugs.

Comment on lines +406 to +408
tool_call.get("tool_name")
or tool_call.get("name")
or ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_name & name were used to make it compatible with older LCORE version. But is it the same scenario for RLSAPI ? can we use actual property name ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — RLSAPI uses its own native fields (name/args), not the legacy OLS tool_name/arguments format. Removed the fallback chain in _rlsapi_infer_query and now use tool_call.get("name", "") and tool_call.get("args", {}) directly. The legacy fallback in _standard_query is untouched since OLS may still use either format.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the whole flow, I think that this (overall API interaction/data processing) can be modularized further. Non-blocker for this PR.

# Remove skipped conversations
evaluation_data = [e for e in evaluation_data if not e.skip]

# Filter turn_metrics if --metrics was specified
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this logic, I see two limitations

  1. --metrics will only have turn level metric selection, not conversation level
  2. This will work only if there is a metric override from eval_data.yaml. But we can also set default metrics in system.yaml/config and not override from eval_data.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both issues addressed:

  1. turn_metrics=None handling: Fixed the guard from if turn.turn_metrics: to if turn.turn_metrics is not None:. When turn_metrics is None (use system defaults) and --metrics is specified, we now materialize the system defaults (entries with default: true) and then filter — so downstream MetricManager sees a non-None filtered list.

  2. conversation_metrics filtering: Added. Same three-way handling: None → materialize defaults then filter, [] → leave as-is (skip all), [...] → filter the explicit list.

Added 3 tests covering all cases.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't mention that metric resolution is already handled.

Here we are adding 1. duplicate logic to process turn & conversation level logic, 2. Managing metric resolution from config/system. But this is already managed in MetricManager where we resolve the final turn and conversation level metrics. Moving the CLI filter there will eliminate the duplicate code significantly and will be more consistent.

This can be refactored later (follow up PR). I am considering this as non-blocker

- Revert DEFAULT_LLM_RETRIES from 5 to 3
- Narrow retry codes to (429, 502, 503, 504), exclude 500
- Use RLSAPI native fields (name/args) in _rlsapi_infer_query
- Fix RAG chunk accumulation across multiple mcp_call results
- Redact prompt from debug log, log only metadata
- Add comment about extra_request_params not forwarded to /infer
- Fix tool result capture: use content with status fallback
- Update endpoint_type description to include infer
- Move skip tests from TestFilterByScope to TestDataValidator
- Fix MockerFixture import in test_validator.py
- Fix --metrics filter: handle turn_metrics=None by materializing
  system defaults before filtering; add conversation_metrics filter
- Add metrics=None to runner test fixture for --metrics support
- Add tests for metrics filter materialization

Signed-off-by: Ellis Low <elow@redhat.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit/runner/test_evaluation.py (1)

356-361: Consider adding coverage for a non-None metrics value.

Both updated assertions only verify metrics=None is forwarded. Since this PR introduces the --metrics filter as a new feature, a small parametrized case (or extra test) that exercises a non-None list (e.g. metrics=["custom:answer_correctness"]) would lock in the wiring from CLI → eval_args.metricsDataValidator.load_evaluation_data(...). The test_main_with_filters parametrize at lines 875-885 would be a natural place to extend.

Also applies to: 412-417

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/runner/test_evaluation.py` around lines 356 - 361, Add a
parametrized case to tests/unit/runner/test_evaluation.py (extend the existing
test_main_with_filters parametrize) that passes a non-None metrics list through
the CLI (e.g. --metrics "custom:answer_correctness") and assert that
mock_validator.return_value.load_evaluation_data is called with
metrics=["custom:answer_correctness"]; this ensures the CLI → eval_args.metrics
→ DataValidator.load_evaluation_data wiring is tested (refer to
mock_validator.return_value.load_evaluation_data and eval_args.metrics in the
test).
tests/unit/core/api/test_client.py (1)

692-724: Add 504 case to retryable predicate test.

_is_retryable_server_error retries (429, 502, 503, 504) but this test only exercises 502/503 from the 5xx set. Adding a 504 assertion closes the obvious coverage gap and guards against regressions if someone trims the tuple.

♻️ Proposed addition
         resp_503 = mocker.Mock(status_code=503)
         assert _is_retryable_server_error(
             httpx.HTTPStatusError("", request=mocker.Mock(), response=resp_503)
         )
 
+        resp_504 = mocker.Mock(status_code=504)
+        assert _is_retryable_server_error(
+            httpx.HTTPStatusError("", request=mocker.Mock(), response=resp_504)
+        )
+
         resp_400 = mocker.Mock(status_code=400)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/core/api/test_client.py` around lines 692 - 724, The test
test_is_retryable_server_error is missing coverage for the 504 status code;
update that test to create a mock response with status_code=504 and assert that
_is_retryable_server_error(httpx.HTTPStatusError("", request=mocker.Mock(),
response=resp_504)) returns True so the tuple of retryable codes (429, 502, 503,
504) is fully exercised and prevents regressions in _is_retryable_server_error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/core/api/test_client.py`:
- Around line 692-724: The test test_is_retryable_server_error is missing
coverage for the 504 status code; update that test to create a mock response
with status_code=504 and assert that
_is_retryable_server_error(httpx.HTTPStatusError("", request=mocker.Mock(),
response=resp_504)) returns True so the tuple of retryable codes (429, 502, 503,
504) is fully exercised and prevents regressions in _is_retryable_server_error.

In `@tests/unit/runner/test_evaluation.py`:
- Around line 356-361: Add a parametrized case to
tests/unit/runner/test_evaluation.py (extend the existing test_main_with_filters
parametrize) that passes a non-None metrics list through the CLI (e.g. --metrics
"custom:answer_correctness") and assert that
mock_validator.return_value.load_evaluation_data is called with
metrics=["custom:answer_correctness"]; this ensures the CLI → eval_args.metrics
→ DataValidator.load_evaluation_data wiring is tested (refer to
mock_validator.return_value.load_evaluation_data and eval_args.metrics in the
test).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ec63bca-2b99-4992-bcaf-400570283420

📥 Commits

Reviewing files that changed from the base of the PR and between ce00854 and be8019d.

📒 Files selected for processing (7)
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/system/validator.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/runner/test_evaluation.py
✅ Files skipped from review due to trivial changes (3)
  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/api/client.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lightspeed_evaluation/core/system/validator.py
  • tests/unit/core/system/test_validator.py

Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !! Please fix the linting issue..

Non-blocking issues:

  1. Metrics filter: Duplicate logic for metric resolution.
  2. API client: Now we have 3 separate processing. This can be also refactored further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants