feat: add metrics CLI filter, skip field, retry improvements, and infer endpoint#222
feat: add metrics CLI filter, skip field, retry improvements, and infer endpoint#222Lifto wants to merge 6 commits intolightspeed-core:mainfrom
Conversation
Allows running a subset of configured metrics without editing YAML configs. Example: --metrics custom:answer_correctness to skip RAGAS metrics.
Adds skip and skip_reason fields to EvaluationData. Conversations with skip: true are silently excluded during loading. skip_reason is documentation-only — it stays in the YAML for humans to read.
Broaden retry logic from HTTP 429 only to include 5xx server errors, enabling automatic retry with exponential backoff for transient server failures. Add RLSAPI /v1/infer endpoint support for tool call and RAG chunk metadata retrieval, used by RHEL Lightspeed backend testing.
Provides more resilience against transient server failures, especially during long evaluation runs.
WalkthroughAdds an "infer" endpoint to the API client, broadens retry logic to include HTTP 429 and select 5xx errors, introduces conversation-level Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(0,128,255,0.5)
actor Client
end
rect rgba(0,200,0,0.5)
participant APIClient as APIClient\nquery()
participant Transport as HTTP Transport
participant InferAPI as /infer\nEndpoint
participant Parser as Response\nParser
end
Client->>APIClient: query(endpoint_type="infer", input)
APIClient->>APIClient: _rlsapi_infer_query()\nbuild request (question + metadata)
APIClient->>Transport: POST /api/lightspeed/{version}/infer
Transport->>InferAPI: forward request
InferAPI-->>Transport: response JSON
Transport-->>APIClient: response object
APIClient->>Parser: parse response.data
Parser->>Parser: map to APIResponse\nmatch tool_calls -> tool_results\nextract RAG chunks
Parser-->>APIClient: APIResponse
APIClient-->>Client: return APIResponse
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/core/models/system.py (1)
272-275:⚠️ Potential issue | 🟡 MinorStale field description — still says "streaming or query".
Now that
"infer"is a supported endpoint type, update theendpoint_typeField description to include it (e.g."API endpoint type (streaming, query, or infer)").Proposed fix
endpoint_type: str = Field( default=DEFAULT_ENDPOINT_TYPE, - description="API endpoint type (streaming or query)", + description="API endpoint type (streaming, query, or infer)", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/core/models/system.py` around lines 272 - 275, The Field description for endpoint_type (the endpoint_type attribute using Field with default DEFAULT_ENDPOINT_TYPE) is outdated; update its description string to list the new supported value by changing "API endpoint type (streaming or query)" to include "infer" (e.g., "API endpoint type (streaming, query, or infer)") so the Field accurately documents supported endpoint types.tests/unit/core/system/test_validator.py (1)
1-18:⚠️ Potential issue | 🔴 CriticalAdd missing
MockerFixtureimport — module will fail at import time.The test methods on lines 567 and 592 use
mocker: MockerFixturetype annotations, butMockerFixtureis never imported. Withoutfrom __future__ import annotations, Python evaluates function annotations at definition time, causingNameError: name 'MockerFixture' is not definedduring pytest collection.Proposed fix
import pytest from pydantic import ValidationError +from pytest_mock import MockerFixture from lightspeed_evaluation.core.models import EvaluationData, SystemConfig, TurnData🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/core/system/test_validator.py` around lines 1 - 18, Add the missing type import for MockerFixture so annotations like "mocker: MockerFixture" in the test functions that use the mocker fixture resolve at import time; specifically, add "from pytest_mock import MockerFixture" alongside the other imports at the top of the test file so tests referencing MockerFixture (the functions annotated with mocker: MockerFixture) no longer raise NameError during collection.
🧹 Nitpick comments (2)
tests/unit/core/system/test_validator.py (1)
567-606: Tests misplaced: belong inTestDataValidator, notTestFilterByScope.These two tests exercise
DataValidator.load_evaluation_data(skip-filtering behavior), not_filter_by_scope. Placing them underTestFilterByScopeis misleading and breaks the file's “mirror-behavior-under-test” structure. Consider moving them intoTestDataValidator(which already containstest_load_evaluation_data_*tests) or into a dedicatedTestLoadEvaluationDataSkipclass.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/core/system/test_validator.py` around lines 567 - 606, The two tests test_skip_removes_conversation and test_skip_false_keeps_conversation belong with tests for DataValidator.load_evaluation_data, not scope filtering; move their definitions out of the TestFilterByScope class and place them inside the TestDataValidator class (or create a new TestLoadEvaluationDataSkip class) so they mirror the unit-under-test (DataValidator.load_evaluation_data) and keep file structure consistent; ensure any class-level fixtures/imports used by TestFilterByScope are available in the new location and update the test class name if you create a new dedicated class.src/lightspeed_evaluation/core/api/client.py (1)
30-42: Verify 5xx retries are safe for POST query endpoints.The new predicate retries all 5xx responses from POST-based query paths. If the server processes a turn/tool call and then returns 5xx, this can replay side effects. Please confirm these endpoints are idempotent, or add a stable idempotency/request key before retrying 5xx responses.
Also applies to: 71-80, 297-300, 324-327, 436-439
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/core/api/client.py` around lines 30 - 42, The _is_retryable_server_error predicate currently retries all 5xx responses unconditionally which can replay side-effects for POST-based query endpoints; update _is_retryable_server_error (and the same logic at the other occurrences) to either (a) only treat 5xx as retryable for safe idempotent HTTP methods (check exception.response.request.method and allow retries for GET/HEAD/OPTIONS/PUT when appropriate), or (b) require and verify a stable idempotency/request key header (e.g., "Idempotency-Key") on the original request before returning True for 5xx so POSTs are only retried when an idempotency key is present. Ensure you reference exception.response.request.method and exception.response.request.headers when implementing this change so POSTs without an idempotency key are not retried.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 416-428: The code currently assigns only matching_result["status"]
to formatted_tool["result"], discarding the actual tool output; update the block
that finds matching_result (using response_data, tool_call, formatted_tool,
formatted_tool_calls) to attach the full payload: set formatted_tool["result"] =
matching_result.get("content") (or matching_result["content"] if guaranteed) and
also preserve status separately e.g. formatted_tool["status"] =
matching_result.get("status"); leave
formatted_tool_calls.append([formatted_tool]) as-is.
- Around line 360-364: The debug log currently prints the entire infer_request
(logger.debug("RLSAPI infer request body: %s", infer_request)), exposing the
user prompt; update the logging so it only emits non-sensitive metadata: log
self.config.version, infer_request.get("include_metadata") and the prompt length
(e.g., len(infer_request.get("prompt",""))), and remove or redact the raw prompt
text from the logger.debug call; ensure you keep the existing logger.debug for
the URL (using self.config.version) and replace the full-body log with the
sanitized metadata fields.
- Around line 386-393: The code currently overwrites response_data["rag_chunks"]
for each mcp_call so only the last tool result is kept; fix it by initializing
response_data["rag_chunks"] = [] if missing, then for each result where
result.get("type") == "mcp_call" split result["content"] and extend (not assign)
response_data["rag_chunks"] with {"content": chunk} entries (filtering out
empty/whitespace chunks) so all MCP tool results accumulate rather than replace
previous ones.
- Around line 354-358: The current logic in the block that builds infer_request
(using api_request.model_dump -> request_data and then setting infer_request =
{"question": request_data.pop("query"), "include_metadata": True}) silently
drops prepared fields like extra_request_params when endpoint_type == "infer";
update the construction in the function that handles endpoint_type == "infer" to
explicitly preserve or merge only allowed keys from request_data (e.g., merge
request_data.get("extra_request_params", {}) or filter request_data by an
allowed list) into infer_request instead of discarding request_data, referencing
the variables request_data, infer_request and the api_request.model_dump call to
locate the code.
In `@src/lightspeed_evaluation/core/constants.py`:
- Around line 59-62: DEFAULT_API_NUM_RETRIES is still set to 3 while
DEFAULT_LLM_RETRIES was bumped to 5, so API calls using APIConfig.num_retries
are not getting the intended increased retry behavior; update
DEFAULT_API_NUM_RETRIES to 5 to match DEFAULT_LLM_RETRIES (or explicitly
document/choose a different policy) so the API client's retry decorator uses the
new default number of retries.
---
Outside diff comments:
In `@src/lightspeed_evaluation/core/models/system.py`:
- Around line 272-275: The Field description for endpoint_type (the
endpoint_type attribute using Field with default DEFAULT_ENDPOINT_TYPE) is
outdated; update its description string to list the new supported value by
changing "API endpoint type (streaming or query)" to include "infer" (e.g., "API
endpoint type (streaming, query, or infer)") so the Field accurately documents
supported endpoint types.
In `@tests/unit/core/system/test_validator.py`:
- Around line 1-18: Add the missing type import for MockerFixture so annotations
like "mocker: MockerFixture" in the test functions that use the mocker fixture
resolve at import time; specifically, add "from pytest_mock import
MockerFixture" alongside the other imports at the top of the test file so tests
referencing MockerFixture (the functions annotated with mocker: MockerFixture)
no longer raise NameError during collection.
---
Nitpick comments:
In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 30-42: The _is_retryable_server_error predicate currently retries
all 5xx responses unconditionally which can replay side-effects for POST-based
query endpoints; update _is_retryable_server_error (and the same logic at the
other occurrences) to either (a) only treat 5xx as retryable for safe idempotent
HTTP methods (check exception.response.request.method and allow retries for
GET/HEAD/OPTIONS/PUT when appropriate), or (b) require and verify a stable
idempotency/request key header (e.g., "Idempotency-Key") on the original request
before returning True for 5xx so POSTs are only retried when an idempotency key
is present. Ensure you reference exception.response.request.method and
exception.response.request.headers when implementing this change so POSTs
without an idempotency key are not retried.
In `@tests/unit/core/system/test_validator.py`:
- Around line 567-606: The two tests test_skip_removes_conversation and
test_skip_false_keeps_conversation belong with tests for
DataValidator.load_evaluation_data, not scope filtering; move their definitions
out of the TestFilterByScope class and place them inside the TestDataValidator
class (or create a new TestLoadEvaluationDataSkip class) so they mirror the
unit-under-test (DataValidator.load_evaluation_data) and keep file structure
consistent; ensure any class-level fixtures/imports used by TestFilterByScope
are available in the new location and update the test class name if you create a
new dedicated class.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 976ba54f-6f49-49c9-a115-1035789edb3c
📒 Files selected for processing (9)
src/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/runner/evaluation.pytests/unit/core/api/conftest.pytests/unit/core/api/test_client.pytests/unit/core/system/test_validator.py
| request_data = api_request.model_dump(exclude_none=True) | ||
| infer_request: dict[str, object] = { | ||
| "question": request_data.pop("query"), | ||
| "include_metadata": True, | ||
| } |
There was a problem hiding this comment.
Don’t drop prepared request parameters for /infer.
Line 356 sends only question and include_metadata, so extra_request_params and other prepared request fields are ignored for endpoint_type == "infer". If /infer supports only a subset, filter explicitly instead of silently discarding everything.
Proposed fix
- request_data = api_request.model_dump(exclude_none=True)
+ request_data = self._serialize_request(api_request)
+ question = request_data.pop("query")
infer_request: dict[str, object] = {
- "question": request_data.pop("query"),
+ **request_data,
+ "question": question,
"include_metadata": True,
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_evaluation/core/api/client.py` around lines 354 - 358, The
current logic in the block that builds infer_request (using
api_request.model_dump -> request_data and then setting infer_request =
{"question": request_data.pop("query"), "include_metadata": True}) silently
drops prepared fields like extra_request_params when endpoint_type == "infer";
update the construction in the function that handles endpoint_type == "infer" to
explicitly preserve or merge only allowed keys from request_data (e.g., merge
request_data.get("extra_request_params", {}) or filter request_data by an
allowed list) into infer_request instead of discarding request_data, referencing
the variables request_data, infer_request and the api_request.model_dump call to
locate the code.
There was a problem hiding this comment.
Fixed — added a code comment explaining that extra_request_params are not forwarded to /infer since the endpoint only accepts question and include_metadata. Other params (model, provider, etc.) are not part of the RLSAPI /infer API contract.
There was a problem hiding this comment.
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.
| SUPPORTED_ENDPOINT_TYPES = ["streaming", "query", "infer"] | ||
| DEFAULT_API_CACHE_DIR = ".caches/api_cache" | ||
|
|
||
| DEFAULT_API_NUM_RETRIES = 3 |
There was a problem hiding this comment.
Inconsistent retry default: API retries still 3 while LLM retries bumped to 5.
The PR description states default retry attempts were increased from 3 to 5, but DEFAULT_API_NUM_RETRIES (line 62) is still 3 while only DEFAULT_LLM_RETRIES (line 73) was updated. Note that the API client's retry decorator uses APIConfig.num_retries (which defaults to DEFAULT_API_NUM_RETRIES), so the advertised retry bump does not actually reach the new 429/5xx retry path for API calls. Confirm whether the API default should also be raised to 5 for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lightspeed_evaluation/core/constants.py` around lines 59 - 62,
DEFAULT_API_NUM_RETRIES is still set to 3 while DEFAULT_LLM_RETRIES was bumped
to 5, so API calls using APIConfig.num_retries are not getting the intended
increased retry behavior; update DEFAULT_API_NUM_RETRIES to 5 to match
DEFAULT_LLM_RETRIES (or explicitly document/choose a different policy) so the
API client's retry decorator uses the new default number of retries.
There was a problem hiding this comment.
Addressed — reverted DEFAULT_LLM_RETRIES back to 3. As asamal4 noted, this is the judge LLM default, not API retry. Actual values are configurable per-config.
There was a problem hiding this comment.
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.
asamal4
left a comment
There was a problem hiding this comment.
Thank you !! I have added few comments + there are some valid coderabbit comments. PTAL..
| DEFAULT_LLM_TEMPERATURE = 0.0 | ||
| DEFAULT_LLM_MAX_TOKENS = 512 | ||
| DEFAULT_LLM_RETRIES = 3 | ||
| DEFAULT_LLM_RETRIES = 5 |
There was a problem hiding this comment.
This is for JudgeLLM, not for API call.. Also this is just the default value, actual value can be set in config as per the need. Do we really need to change this ?
There was a problem hiding this comment.
Agreed, reverted to 3. Will configure via eval config when needed.
| if not isinstance(exception, httpx.HTTPStatusError): | ||
| return False | ||
| status = exception.response.status_code | ||
| return status == 429 or 500 <= status < 600 |
There was a problem hiding this comment.
Instead of range, could you please write the exact status code like 502, 503..
I understand the purpose, but at the same time I am concerned that we will end up wasting retries.
In future we will probably make it configurable.
There was a problem hiding this comment.
Done — narrowed to status in (429, 502, 503, 504). Only clearly transient errors: 429=rate limit, 502=bad gateway, 503=unavailable, 504=gateway timeout. 500 excluded since it can indicate permanent server bugs.
| tool_call.get("tool_name") | ||
| or tool_call.get("name") | ||
| or "" |
There was a problem hiding this comment.
tool_name & name were used to make it compatible with older LCORE version. But is it the same scenario for RLSAPI ? can we use actual property name ?
There was a problem hiding this comment.
Good point — RLSAPI uses its own native fields (name/args), not the legacy OLS tool_name/arguments format. Removed the fallback chain in _rlsapi_infer_query and now use tool_call.get("name", "") and tool_call.get("args", {}) directly. The legacy fallback in _standard_query is untouched since OLS may still use either format.
There was a problem hiding this comment.
Looking at the whole flow, I think that this (overall API interaction/data processing) can be modularized further. Non-blocker for this PR.
| # Remove skipped conversations | ||
| evaluation_data = [e for e in evaluation_data if not e.skip] | ||
|
|
||
| # Filter turn_metrics if --metrics was specified |
There was a problem hiding this comment.
Based on this logic, I see two limitations
- --metrics will only have turn level metric selection, not conversation level
- This will work only if there is a metric override from eval_data.yaml. But we can also set default metrics in system.yaml/config and not override from eval_data.
There was a problem hiding this comment.
Both issues addressed:
-
turn_metrics=Nonehandling: Fixed the guard fromif turn.turn_metrics:toif turn.turn_metrics is not None:. Whenturn_metricsisNone(use system defaults) and--metricsis specified, we now materialize the system defaults (entries withdefault: true) and then filter — so downstream MetricManager sees a non-None filtered list. -
conversation_metrics filtering: Added. Same three-way handling:
None→ materialize defaults then filter,[]→ leave as-is (skip all),[...]→ filter the explicit list.
Added 3 tests covering all cases.
There was a problem hiding this comment.
Sorry I didn't mention that metric resolution is already handled.
Here we are adding 1. duplicate logic to process turn & conversation level logic, 2. Managing metric resolution from config/system. But this is already managed in MetricManager where we resolve the final turn and conversation level metrics. Moving the CLI filter there will eliminate the duplicate code significantly and will be more consistent.
This can be refactored later (follow up PR). I am considering this as non-blocker
- Revert DEFAULT_LLM_RETRIES from 5 to 3 - Narrow retry codes to (429, 502, 503, 504), exclude 500 - Use RLSAPI native fields (name/args) in _rlsapi_infer_query - Fix RAG chunk accumulation across multiple mcp_call results - Redact prompt from debug log, log only metadata - Add comment about extra_request_params not forwarded to /infer - Fix tool result capture: use content with status fallback - Update endpoint_type description to include infer - Move skip tests from TestFilterByScope to TestDataValidator - Fix MockerFixture import in test_validator.py - Fix --metrics filter: handle turn_metrics=None by materializing system defaults before filtering; add conversation_metrics filter - Add metrics=None to runner test fixture for --metrics support - Add tests for metrics filter materialization Signed-off-by: Ellis Low <elow@redhat.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/runner/test_evaluation.py (1)
356-361: Consider adding coverage for a non-Nonemetricsvalue.Both updated assertions only verify
metrics=Noneis forwarded. Since this PR introduces the--metricsfilter as a new feature, a small parametrized case (or extra test) that exercises a non-None list (e.g.metrics=["custom:answer_correctness"]) would lock in the wiring from CLI →eval_args.metrics→DataValidator.load_evaluation_data(...). Thetest_main_with_filtersparametrize at lines 875-885 would be a natural place to extend.Also applies to: 412-417
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/runner/test_evaluation.py` around lines 356 - 361, Add a parametrized case to tests/unit/runner/test_evaluation.py (extend the existing test_main_with_filters parametrize) that passes a non-None metrics list through the CLI (e.g. --metrics "custom:answer_correctness") and assert that mock_validator.return_value.load_evaluation_data is called with metrics=["custom:answer_correctness"]; this ensures the CLI → eval_args.metrics → DataValidator.load_evaluation_data wiring is tested (refer to mock_validator.return_value.load_evaluation_data and eval_args.metrics in the test).tests/unit/core/api/test_client.py (1)
692-724: Add 504 case to retryable predicate test.
_is_retryable_server_errorretries(429, 502, 503, 504)but this test only exercises 502/503 from the 5xx set. Adding a 504 assertion closes the obvious coverage gap and guards against regressions if someone trims the tuple.♻️ Proposed addition
resp_503 = mocker.Mock(status_code=503) assert _is_retryable_server_error( httpx.HTTPStatusError("", request=mocker.Mock(), response=resp_503) ) + resp_504 = mocker.Mock(status_code=504) + assert _is_retryable_server_error( + httpx.HTTPStatusError("", request=mocker.Mock(), response=resp_504) + ) + resp_400 = mocker.Mock(status_code=400)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/core/api/test_client.py` around lines 692 - 724, The test test_is_retryable_server_error is missing coverage for the 504 status code; update that test to create a mock response with status_code=504 and assert that _is_retryable_server_error(httpx.HTTPStatusError("", request=mocker.Mock(), response=resp_504)) returns True so the tuple of retryable codes (429, 502, 503, 504) is fully exercised and prevents regressions in _is_retryable_server_error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/core/api/test_client.py`:
- Around line 692-724: The test test_is_retryable_server_error is missing
coverage for the 504 status code; update that test to create a mock response
with status_code=504 and assert that
_is_retryable_server_error(httpx.HTTPStatusError("", request=mocker.Mock(),
response=resp_504)) returns True so the tuple of retryable codes (429, 502, 503,
504) is fully exercised and prevents regressions in _is_retryable_server_error.
In `@tests/unit/runner/test_evaluation.py`:
- Around line 356-361: Add a parametrized case to
tests/unit/runner/test_evaluation.py (extend the existing test_main_with_filters
parametrize) that passes a non-None metrics list through the CLI (e.g. --metrics
"custom:answer_correctness") and assert that
mock_validator.return_value.load_evaluation_data is called with
metrics=["custom:answer_correctness"]; this ensures the CLI → eval_args.metrics
→ DataValidator.load_evaluation_data wiring is tested (refer to
mock_validator.return_value.load_evaluation_data and eval_args.metrics in the
test).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2ec63bca-2b99-4992-bcaf-400570283420
📒 Files selected for processing (7)
src/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/system/validator.pytests/unit/core/api/test_client.pytests/unit/core/system/test_validator.pytests/unit/runner/test_evaluation.py
✅ Files skipped from review due to trivial changes (3)
- src/lightspeed_evaluation/core/models/system.py
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/api/client.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lightspeed_evaluation/core/system/validator.py
- tests/unit/core/system/test_validator.py
asamal4
left a comment
There was a problem hiding this comment.
LGTM !! Please fix the linting issue..
Non-blocking issues:
- Metrics filter: Duplicate logic for metric resolution.
- API client: Now we have 3 separate processing. This can be also refactored further.
Summary
--metricsCLI flag: Filter which metrics run during evaluation (e.g.--metrics custom:answer_correctnessto skip expensive metrics)skipfield for conversations: Addskip: trueandskip_reasonto eval YAML to disable conversations without deleting them/inferendpoint support: Add RLSAPI/v1/inferendpoint type with metadata extraction (tool_calls, rag_chunks, token counts)Testing
test_validator.pyunrelated to this PR)/inferendpoint tested live against local RHEL Lightspeed stackNotes
These features were developed on a fork (
emac-E/lightspeed-evaluation) and cherry-picked/ported to upstream's currentmain. The retry and infer changes were manually ported due to structural divergence inclient.py.Summary by CodeRabbit
New Features
Improvements