Added langfuse compatibility with the lightspeed-eval#223
Added langfuse compatibility with the lightspeed-eval#223bsatapat-jpg wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 51 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (18)
WalkthroughAdds optional Langfuse trace reporting via an on_complete callback: new EvaluationRunContext model, Langfuse reporter utilities, API/runner/pipeline plumbing to accept and invoke on_complete, CLI/env opt-in, README and dependency updates, and accompanying tests. Changes
Sequence DiagramsequenceDiagram
participant Client
participant API as evaluate()
participant Pipeline as EvaluationPipeline
participant Callback as on_complete
participant Langfuse as Langfuse Client
Client->>API: evaluate(..., evaluation_data_path, on_complete)
API->>Pipeline: run_evaluation(..., original_data_path, on_complete)
Pipeline->>Pipeline: perform evaluation -> results
Pipeline->>Callback: on_complete(results, EvaluationRunContext)
Callback->>Langfuse: init client & create trace (from run_name)
Callback->>Langfuse: emit score per metric
Callback->>Langfuse: flush()
Pipeline-->>API: return results
API-->>Client: return results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (8)
src/lightspeed_evaluation/integrations/langfuse_reporter.py (2)
84-90: Usercontext.metadatacan silently overwrite reserved trace keys.
trace_meta.update(context.metadata or {})runs afterrun_name,result_count, andoriginal_data_pathare set, so any of those keys passed by a caller inmetadatawill clobber the values the reporter is trying to guarantee. Inverting the merge (user first, reserved last) preserves the invariant.Proposed fix
- trace_meta: dict[str, Any] = { - "run_name": context.run_name, - "result_count": len(results), - } - if context.original_data_path is not None: - trace_meta["original_data_path"] = context.original_data_path - trace_meta.update(context.metadata or {}) + trace_meta: dict[str, Any] = dict(context.metadata or {}) + trace_meta["run_name"] = context.run_name + trace_meta["result_count"] = len(results) + if context.original_data_path is not None: + trace_meta["original_data_path"] = context.original_data_path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 84 - 90, The merge order currently allows user-supplied context.metadata to overwrite reserved keys because trace_meta is populated first then trace_meta.update(context.metadata or {}); reverse the merge so you start with user metadata (e.g., initialize trace_meta = dict(context.metadata or {})), then set/override the reserved keys ("run_name", "result_count", and optionally "original_data_path") afterwards so the reporter's guaranteed values (in the trace_meta dict used by the Langfuse reporter) cannot be clobbered by context.metadata; ensure you still handle context.original_data_path None-check and preserve existing typing for trace_meta.
34-47: Docstring drift: comment format no longer includes explicit pass/fail wording.The Args docstring says the
commentcontains "pass/fail status and a truncated reason", but_format_commentemitsresult=<r.result>(wherer.resultis one of PASS/FAIL/ERROR/SKIPPED). That's functionally equivalent but not what the docstring promises — a reader parsing Langfuse comments for a literal "pass"/"fail" substring will be misled for ERROR/SKIPPED rows. Suggest clarifying: "aresult=<STATUS>field and optional truncated reason".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 34 - 47, Update the docstring to reflect the actual comment format emitted by _format_comment: change the phrase that says the comment contains "pass/fail status and a truncated reason" to indicate it emits a "result=<STATUS>" field (where STATUS is PASS/FAIL/ERROR/SKIPPED) and an optional truncated reason; mention that PASS/FAIL correspond to pass/fail while ERROR/SKIPPED are reported as those literal statuses so downstream parsers should look for result=<STATUS> rather than the words "pass"/"fail".tests/unit/test_api.py (1)
47-49: Assertions correctly distinguish the wrapper boundary (evaluation_data_path) from the pipeline boundary (original_data_path).The naming difference is subtle but matches
src/lightspeed_evaluation/api.pywhereevaluate(...)forwardsevaluation_data_path=...intopipeline.run_evaluation(original_data_path=...). Good coverage of the default (None) path.One gap: there is no test covering a non-
Noneon_completebeing forwarded throughevaluate/evaluate_conversation/evaluate_turn— worth adding a single assertion (e.g.on_complete=callback_sentinel) to lock in the pass-through contract the Langfuse integration relies on.Also applies to: 116-122, 135-141, 293-300
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_api.py` around lines 47 - 49, Add assertions to the existing unit tests to verify that a non-None on_complete callback is forwarded through the API wrapper into the pipeline; specifically, in tests that call evaluate, evaluate_conversation, and evaluate_turn (which currently assert mock_pipeline.run_evaluation.called with original_data_path=None), add a sentinel (e.g., callback_sentinel) passed as on_complete to the evaluate* call and assert mock_pipeline.run_evaluation.assert_called_with(..., on_complete=callback_sentinel) or the equivalent for run_conversation/run_turn; update the assertions at the spots mentioned (around lines 47, 116-122, 135-141, 293-300) to include the on_complete=callback_sentinel check so the tests lock in the pass-through behavior.tests/unit/runner/test_evaluation.py (1)
52-64: Test updates mirror the runner signature change correctly.One follow-up worth considering: there is no test in this file that exercises
_make_eval_args(langfuse=True)end-to-end to assert thaton_completebecomes non-Nonein theevaluate(...)call. The callback wiring is arguably the most error-prone part of the runner change; a single parametrized case here (mockingbuild_langfuse_on_complete_callback) would catch regressions cheaply.Also applies to: 170-176, 231-237
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/runner/test_evaluation.py` around lines 52 - 64, Add a parametrized unit test that exercises _make_eval_args(langfuse=True) and verifies the runner wires the Langfuse callback into evaluate: mock build_langfuse_on_complete_callback to return a sentinel callback, call evaluate(...) with args from _make_eval_args(langfuse=True), and assert evaluate was invoked with on_complete equal to the sentinel (non-None). Locate uses of _make_eval_args, evaluate, and build_langfuse_on_complete_callback in this test module to add the new case (or extend the existing parametrized cases around lines where _make_eval_args is used) so regressions in callback wiring are caught.src/lightspeed_evaluation/integrations/__init__.py (1)
3-11: Re-exports look clean.Note: because this package
__init__.pyunconditionally imports fromlangfuse_reporter, importinglightspeed_evaluation.integrationspulls in the reporter module. That's fine today becauselangfuse_reporteritself only lazy-importslangfuseinside the function body — but if anyone later moves theimport langfuseto module scope,from lightspeed_evaluation.integrations import ...would start failing without the optional extra installed. Worth a short comment or a try/except around the re-export to make the "opt-in per extra" promise robust to future edits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/integrations/__init__.py` around lines 3 - 11, The module-level import of build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse makes importing lightspeed_evaluation.integrations risk raising if langfuse becomes a hard dependency; update src/lightspeed_evaluation/integrations/__init__.py to guard the re-exports by either wrapping the import of langfuse_reporter in a try/except ImportError (and only add the two symbols to __all__ when import succeeds) or add a clear comment explaining the deliberate lazy-import behavior so future edits preserve the optional-extra contract; reference the imported symbols build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse and the module name langfuse_reporter when making the change.README.md (1)
494-496: Consider adding a version note for thelangfuseextra.The runtime module docstring pins to
langfuse>=2,<3, but that constraint isn't surfaced here. A one-line note (e.g. "Requires Langfuse SDK v2") would prevent users from hitting the runtime failure described in thelangfuse_reporter.pycomment if they already have a v3 client installed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 494 - 496, Add a one-line note in the "Optional: Langfuse" README section stating the required Langfuse SDK version (e.g., "Requires Langfuse SDK v2: install with lightspeed-evaluation[langfuse] and langfuse>=2,<3") so users don't accidentally use v3; reference the runtime module pin (langfuse>=2,<3) and the build_langfuse_on_complete_callback symbol in lightspeed_evaluation.integrations.langfuse_reporter to align the doc with the actual runtime constraint.tests/unit/integrations/test_langfuse_reporter.py (1)
51-78: Minor:caplog.set_level("ERROR")is set but never asserted on.Both
test_build_callback_invokes_pushandtest_push_with_mock_langfusetake thecaplogfixture and set a level but never inspectcaplog.records/caplog.text. Either drop the fixture to keep the tests focused, or add an assertion that no ERROR-level records were emitted on the success path — that would actually protect against regressions in the callback's "catch and log" branch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/integrations/test_langfuse_reporter.py` around lines 51 - 78, The tests test_build_callback_invokes_push and test_push_with_mock_langfuse set caplog.set_level("ERROR") but never assert against it; either remove the unused caplog fixture from those test signatures or add an explicit assertion (e.g., assert not any(r.levelname == "ERROR" for r in caplog.records) or assert "ERROR" not in caplog.text) after calling cb(...) / push_evaluation_results_to_langfuse(...) to ensure no ERROR-level logs were emitted; update the function signatures and references accordingly (test_build_callback_invokes_push, test_push_with_mock_langfuse, caplog).pyproject.toml (1)
55-66: Heads-up:langfuse<3pins the legacy v2 SDK.The Langfuse Python SDK v3 (OTEL-based) has been GA since June 2025 and v4 is the current major on PyPI (
langfuse==4.5.0). v2 is documented as the legacy SDK and its public API (Langfuse().trace().score()) is not forward-compatible — that shape is used throughoutsrc/lightspeed_evaluation/integrations/langfuse_reporter.pyand the tests. Pinning<3.0.0is fine for now (and your inline comment signals intent), but track this as tech debt: a future v4 bump will require rewriting the reporter to usepropagate_attributes/start_observation/score.create.No action required in this PR unless you want to target v3/v4 directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 55 - 66, The project currently pins langfuse<3 (legacy v2); if you later want to upgrade to the OTEL-based v3/v4 SDK, create a tech-debt ticket and update pyproject.toml to a v4 pin (e.g., "langfuse>=4.0.0,<5.0.0"), then refactor src/lightspeed_evaluation/integrations/langfuse_reporter.py to replace the v2 API calls (Langfuse().trace().score()) with the v3/v4 APIs (use propagate_attributes, start_observation and score.create) and update affected tests to the new API surface; if you prefer to keep v2 for now, add a clear TODO comment near the langfuse entry in pyproject.toml referencing the ticket so this migration is tracked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Line 25: Replace lint suppressions by restructuring and typing: create a small
dataclass (e.g., LangfuseClientConfig) and accept that or **client_kwargs in
push_evaluation_results_to_langfuse to eliminate the too-many-arguments disable;
perform the optional import at module top-level (or use importlib.find_spec at
import time) and set a _HAS_LANGFUSE flag instead of importing inside the
function to remove import-outside-toplevel; add a TYPE_CHECKING-gated import for
langfuse and annotate the local variable as langfuse: "Langfuse" (or import its
types) to remove the # type: ignore comments; and replace the broad except with
the concrete Langfuse exception type (e.g., langfuse.errors.Error) or
requests.RequestException where appropriate to avoid pylint:
disable=broad-exception-caught.
- Around line 102-110: The current loop always sends a numeric 0.0 when
EvaluationResult.r.score is None, conflating missing/ERROR/SKIPPED with real
zeros; update the loop that builds name/_score_name and comment/_format_comment
so that if r.score is None you do not call trace.score(name=..., value=..., ...)
with a 0.0 — instead either skip calling trace.score for that item and record
the status explicitly (r.result) in the comment/metadata, or call trace with an
explicit non-numeric indication supported by Langfuse (e.g., omit value and add
status to comment/metadata) so ERROR and SKIPPED remain distinguishable from
genuine low numeric scores.
---
Nitpick comments:
In `@pyproject.toml`:
- Around line 55-66: The project currently pins langfuse<3 (legacy v2); if you
later want to upgrade to the OTEL-based v3/v4 SDK, create a tech-debt ticket and
update pyproject.toml to a v4 pin (e.g., "langfuse>=4.0.0,<5.0.0"), then
refactor src/lightspeed_evaluation/integrations/langfuse_reporter.py to replace
the v2 API calls (Langfuse().trace().score()) with the v3/v4 APIs (use
propagate_attributes, start_observation and score.create) and update affected
tests to the new API surface; if you prefer to keep v2 for now, add a clear TODO
comment near the langfuse entry in pyproject.toml referencing the ticket so this
migration is tracked.
In `@README.md`:
- Around line 494-496: Add a one-line note in the "Optional: Langfuse" README
section stating the required Langfuse SDK version (e.g., "Requires Langfuse SDK
v2: install with lightspeed-evaluation[langfuse] and langfuse>=2,<3") so users
don't accidentally use v3; reference the runtime module pin (langfuse>=2,<3) and
the build_langfuse_on_complete_callback symbol in
lightspeed_evaluation.integrations.langfuse_reporter to align the doc with the
actual runtime constraint.
In `@src/lightspeed_evaluation/integrations/__init__.py`:
- Around line 3-11: The module-level import of
build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse
makes importing lightspeed_evaluation.integrations risk raising if langfuse
becomes a hard dependency; update
src/lightspeed_evaluation/integrations/__init__.py to guard the re-exports by
either wrapping the import of langfuse_reporter in a try/except ImportError (and
only add the two symbols to __all__ when import succeeds) or add a clear comment
explaining the deliberate lazy-import behavior so future edits preserve the
optional-extra contract; reference the imported symbols
build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse and
the module name langfuse_reporter when making the change.
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 84-90: The merge order currently allows user-supplied
context.metadata to overwrite reserved keys because trace_meta is populated
first then trace_meta.update(context.metadata or {}); reverse the merge so you
start with user metadata (e.g., initialize trace_meta = dict(context.metadata or
{})), then set/override the reserved keys ("run_name", "result_count", and
optionally "original_data_path") afterwards so the reporter's guaranteed values
(in the trace_meta dict used by the Langfuse reporter) cannot be clobbered by
context.metadata; ensure you still handle context.original_data_path None-check
and preserve existing typing for trace_meta.
- Around line 34-47: Update the docstring to reflect the actual comment format
emitted by _format_comment: change the phrase that says the comment contains
"pass/fail status and a truncated reason" to indicate it emits a
"result=<STATUS>" field (where STATUS is PASS/FAIL/ERROR/SKIPPED) and an
optional truncated reason; mention that PASS/FAIL correspond to pass/fail while
ERROR/SKIPPED are reported as those literal statuses so downstream parsers
should look for result=<STATUS> rather than the words "pass"/"fail".
In `@tests/unit/integrations/test_langfuse_reporter.py`:
- Around line 51-78: The tests test_build_callback_invokes_push and
test_push_with_mock_langfuse set caplog.set_level("ERROR") but never assert
against it; either remove the unused caplog fixture from those test signatures
or add an explicit assertion (e.g., assert not any(r.levelname == "ERROR" for r
in caplog.records) or assert "ERROR" not in caplog.text) after calling cb(...) /
push_evaluation_results_to_langfuse(...) to ensure no ERROR-level logs were
emitted; update the function signatures and references accordingly
(test_build_callback_invokes_push, test_push_with_mock_langfuse, caplog).
In `@tests/unit/runner/test_evaluation.py`:
- Around line 52-64: Add a parametrized unit test that exercises
_make_eval_args(langfuse=True) and verifies the runner wires the Langfuse
callback into evaluate: mock build_langfuse_on_complete_callback to return a
sentinel callback, call evaluate(...) with args from
_make_eval_args(langfuse=True), and assert evaluate was invoked with on_complete
equal to the sentinel (non-None). Locate uses of _make_eval_args, evaluate, and
build_langfuse_on_complete_callback in this test module to add the new case (or
extend the existing parametrized cases around lines where _make_eval_args is
used) so regressions in callback wiring are caught.
In `@tests/unit/test_api.py`:
- Around line 47-49: Add assertions to the existing unit tests to verify that a
non-None on_complete callback is forwarded through the API wrapper into the
pipeline; specifically, in tests that call evaluate, evaluate_conversation, and
evaluate_turn (which currently assert mock_pipeline.run_evaluation.called with
original_data_path=None), add a sentinel (e.g., callback_sentinel) passed as
on_complete to the evaluate* call and assert
mock_pipeline.run_evaluation.assert_called_with(...,
on_complete=callback_sentinel) or the equivalent for run_conversation/run_turn;
update the assertions at the spots mentioned (around lines 47, 116-122, 135-141,
293-300) to include the on_complete=callback_sentinel check so the tests lock in
the pass-through behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa8520d4-455b-4d02-90d7-970ce2235866
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
README.mdpyproject.tomlsrc/lightspeed_evaluation/__init__.pysrc/lightspeed_evaluation/api.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/evaluation_run_context.pysrc/lightspeed_evaluation/integrations/__init__.pysrc/lightspeed_evaluation/integrations/langfuse_reporter.pysrc/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/runner/evaluation.pytests/unit/integrations/test_langfuse_reporter.pytests/unit/pipeline/evaluation/test_pipeline.pytests/unit/runner/test_evaluation.pytests/unit/test_api.py
cac2b29 to
2a8a136
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/runner/test_evaluation.py (1)
52-64:⚠️ Potential issue | 🟡 MinorMissing test coverage for the
--langfuse/LIGHTSPEED_USE_LANGFUSEopt-in.
_make_eval_argsnow takeslangfuse=False, and the twoassert_called_once_withupdates confirm the default path passeson_complete=None. However, there’s no test exercising the enabled path:
run_evaluationwithlangfuse=Trueshould importbuild_langfuse_on_complete_callbackand pass a non-Nonecallback intolightspeed_evaluation.api.evaluate.- Same with
LIGHTSPEED_USE_LANGFUSE=1(and the acceptedtrue/yesvariants) in the environment.main()should populateargparse.Namespace.langfusefrom the new--langfuseflag.Given this is the only public entry point for the new feature, at minimum a test that patches
lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callbackand asserts the returned callable is forwarded would be worthwhile.Want me to draft these test cases?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/runner/test_evaluation.py` around lines 52 - 64, Tests are missing for the langfuse opt-in path: add unit tests that call run_evaluation (and main via invoking the CLI entry point) with langfuse=True and with LIGHTSPEED_USE_LANGFUSE set (e.g., "1"/"true"/"yes") that patch lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback to return a sentinel callable and patch lightspeed_evaluation.api.evaluate to capture its on_complete arg; assert evaluate was called with on_complete is the sentinel (non-None) when langfuse is enabled and assert _make_eval_args and main populate argparse.Namespace.langfuse correctly from the --langfuse flag and the environment variable.
🧹 Nitpick comments (5)
src/lightspeed_evaluation/api.py (1)
41-49: Coding-guideline violation:# pylint: disable=too-many-argumentsacross public wrappers.Per repo guidelines for
**/*.py, lint suppressions should be replaced by fixing the underlying issue rather than silencing it. All six wrappers (evaluate,evaluate_with_summary,evaluate_conversation,evaluate_conversation_with_summary,evaluate_turn,evaluate_turn_with_summary) now carry this disable. Since the wrappers differ only in their payload shape, a small options object is a clean fix that removes every disable at once.# In core.models (or alongside EvaluationRunContext) from dataclasses import dataclass `@dataclass`(frozen=True, slots=True) class EvaluateOptions: output_dir: Optional[str] = None evaluation_data_path: Optional[str] = None on_complete: Optional[Callable[[list[EvaluationResult], EvaluationRunContext], None]] = None compute_confidence_intervals: bool = FalseWrappers would then take
options: EvaluateOptions = EvaluateOptions()and forward a single object, dropping every# pylint: disable.As per coding guidelines: "Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead".
Also applies to Lines 86, 129, 162, 199, 245.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/api.py` around lines 41 - 49, Replace the multiple-argument public wrapper signatures that use "# pylint: disable=too-many-arguments" (evaluate, evaluate_with_summary, evaluate_conversation, evaluate_conversation_with_summary, evaluate_turn, evaluate_turn_with_summary) with a single options object parameter: define an immutable dataclass EvaluateOptions (with fields output_dir, evaluation_data_path, on_complete, compute_confidence_intervals, etc.) and change each wrapper to accept options: EvaluateOptions = EvaluateOptions() and forward options to the underlying implementation, removing the lint disables; update all call sites in these wrappers to read options.<continue/>src/lightspeed_evaluation/integrations/langfuse_reporter.py (1)
139-155: User-suppliedcontext.metadatasilently clobbers internal trace keys.
trace_meta.update(context.metadata or {})is performed after settingrun_name,result_count, andoriginal_data_path, so a caller who happens to use any of those keys incontext.metadatawill overwrite the framework-set values on the Langfuse trace. If that is intentional (callers should be able to override), the code is fine; otherwise consider reversing the precedence or namespacing user metadata.♻️ Suggested precedence reversal
- trace_meta: dict[str, Any] = { - "run_name": context.run_name, - "result_count": len(results), - } - if context.original_data_path is not None: - trace_meta["original_data_path"] = context.original_data_path - trace_meta.update(context.metadata or {}) + trace_meta: dict[str, Any] = dict(context.metadata or {}) + trace_meta["run_name"] = context.run_name + trace_meta["result_count"] = len(results) + if context.original_data_path is not None: + trace_meta["original_data_path"] = context.original_data_path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 139 - 155, The trace metadata merge currently lets user-supplied context.metadata overwrite framework keys because trace_meta.update(context.metadata or {}) runs after setting run_name/result_count/original_data_path; change the precedence so framework keys win by applying user metadata first and then updating/overwriting with the framework values (i.e., build trace_meta from context.metadata or {} then set trace_meta["run_name"]=context.run_name, trace_meta["result_count"]=len(results), and optionally set original_data_path if present) so that the values used when creating the langfuse.trace (see trace_meta and trace = langfuse.trace(...)) cannot be clobbered by callers.tests/unit/pipeline/evaluation/test_pipeline.py (1)
163-260: Good coverage; consider one more edge case.The two tests cover the happy-path invocation and exception isolation well. Consider a third that calls
run_evaluationwithoutoriginal_data_pathto lock in thatEvaluationRunContext.run_namefalls back to"evaluation"andoriginal_data_pathisNone— this cements the fallback contract that Langfuse traces will rely on.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/pipeline/evaluation/test_pipeline.py` around lines 163 - 260, Add a new unit test (e.g., test_on_complete_defaults_run_name_and_original_data_path_none) that mirrors the existing on_complete tests but calls EvaluationPipeline.run_evaluation WITHOUT providing original_data_path; patch the same collaborators (MetricManager, APIDataAmender, EvaluationErrorHandler, ScriptExecutionManager, MetricsEvaluator, and ConversationProcessor returning a mock processor that yields a mock EvaluationResult), register an on_complete callback that captures the (results, ctx) pair, invoke pipeline.run_evaluation(sample_evaluation_data, on_complete=on_complete) and assert the callback ran and that ctx.run_name == "evaluation" and ctx.original_data_path is None so the fallback behavior is verified.src/lightspeed_evaluation/runner/evaluation.py (1)
164-178: Minor: accepted truthy values don’t match common conventions.
("1", "true", "yes")covers the help text’s example but omitson, which some users expect (and is accepted by e.g.distutils.util.strtobool). Consider also handlingon/y/tfor symmetry, or document the exact accepted set in the--langfusehelp.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lightspeed_evaluation/runner/evaluation.py` around lines 164 - 178, The env/CLI truthiness check for enabling Langfuse currently tests only ("1","true","yes"); update the logic around use_langfuse (which uses eval_args.langfuse and LIGHTSPEED_USE_LANGFUSE) to accept additional common truthy tokens such as "on", "y", and "t" (or use a standard parser like distutils.util.strtobool with proper ValueError handling) before calling build_langfuse_on_complete_callback; ensure the normalized string comparison is case-insensitive and applied in the same place where use_langfuse is computed so build_langfuse_on_complete_callback() behavior remains unchanged.requirements-all-extras.txt (1)
223-224: Consider targeting Langfuse v4.
langfuse==2.60.10pins to a deprecated v2 release—Langfuse v2 documentation was removed in August 2025, marking its end of life. Langfuse v3 (released June 2025) introduced OpenTelemetry-based observability; however, v4.x (released March 2026) is now current with an observation-centric data model and is the recommended upgrade path. Since this integration is new, target v4 directly (new extralangfuse>=4). If v2 is retained deliberately for compatibility reasons, add a note in the reporter module docstring explaining the rationale and include a tracking issue to migrate.This is auto-generated from
pyproject.toml, so the real change belongs there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements-all-extras.txt` around lines 223 - 224, The project is pinned to langfuse==2.60.10 (v2); update the dependency to target the current major release (langfuse>=4) in the source of truth (pyproject.toml) so the autogenerated requirements-all-extras.txt will reflect langfuse>=4, or if you must keep v2 for compatibility, add a clear explanatory note and migration tracking issue in the reporter module docstring (referencing the reporter module name) that states why v2 is retained and links the issue to migrate to langfuse v4.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 49-81: The docstring promises Langfuse client errors will be
logged and not re-raised, but only client creation is currently protected; wrap
the trace/score/flush interaction so exceptions are caught and logged instead of
propagated. Specifically, in push_evaluation_results_to_langfuse (or inside
_write_langfuse_trace_and_scores) add a try/except around calls that invoke
client.trace(...), trace.score(...), and client.flush() (or any langfuse API
calls) and on exception call logger.error with contextual info and return
gracefully; keep the client creation logic as-is and do not re-raise the caught
exceptions so the implementation matches the docstring.
- Around line 23-26: The code is calling Langfuse v1-style APIs
(client.trace(...), trace.score(...), client.flush()) which do not exist in
langfuse>=2.0.0; update the implementation in langfuse_reporter.py to use the v2
observation API: replace client.trace(name=..., metadata=..., tags=...) with a
context created via client.start_as_current_observation(name=..., metadata=...,
tags=...) (or the recommended observation factory), switch trace.score(name=...,
value=..., comment=...) to client.create_score(...) or use the returned
observation/span's score method per v2 (e.g., span.score_trace or
client.create_score with the observation id), and remove calls to client.flush()
(v2 has no explicit flush). Also ensure the Langfuse import/optional check still
works and update any variable names (e.g., trace -> observation/span) to match
the v2 objects so tests and runtime use the correct v2 method signatures.
---
Outside diff comments:
In `@tests/unit/runner/test_evaluation.py`:
- Around line 52-64: Tests are missing for the langfuse opt-in path: add unit
tests that call run_evaluation (and main via invoking the CLI entry point) with
langfuse=True and with LIGHTSPEED_USE_LANGFUSE set (e.g., "1"/"true"/"yes") that
patch
lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback
to return a sentinel callable and patch lightspeed_evaluation.api.evaluate to
capture its on_complete arg; assert evaluate was called with on_complete is the
sentinel (non-None) when langfuse is enabled and assert _make_eval_args and main
populate argparse.Namespace.langfuse correctly from the --langfuse flag and the
environment variable.
---
Nitpick comments:
In `@requirements-all-extras.txt`:
- Around line 223-224: The project is pinned to langfuse==2.60.10 (v2); update
the dependency to target the current major release (langfuse>=4) in the source
of truth (pyproject.toml) so the autogenerated requirements-all-extras.txt will
reflect langfuse>=4, or if you must keep v2 for compatibility, add a clear
explanatory note and migration tracking issue in the reporter module docstring
(referencing the reporter module name) that states why v2 is retained and links
the issue to migrate to langfuse v4.
In `@src/lightspeed_evaluation/api.py`:
- Around line 41-49: Replace the multiple-argument public wrapper signatures
that use "# pylint: disable=too-many-arguments" (evaluate,
evaluate_with_summary, evaluate_conversation,
evaluate_conversation_with_summary, evaluate_turn, evaluate_turn_with_summary)
with a single options object parameter: define an immutable dataclass
EvaluateOptions (with fields output_dir, evaluation_data_path, on_complete,
compute_confidence_intervals, etc.) and change each wrapper to accept options:
EvaluateOptions = EvaluateOptions() and forward options to the underlying
implementation, removing the lint disables; update all call sites in these
wrappers to read options.<continue/>
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 139-155: The trace metadata merge currently lets user-supplied
context.metadata overwrite framework keys because
trace_meta.update(context.metadata or {}) runs after setting
run_name/result_count/original_data_path; change the precedence so framework
keys win by applying user metadata first and then updating/overwriting with the
framework values (i.e., build trace_meta from context.metadata or {} then set
trace_meta["run_name"]=context.run_name,
trace_meta["result_count"]=len(results), and optionally set original_data_path
if present) so that the values used when creating the langfuse.trace (see
trace_meta and trace = langfuse.trace(...)) cannot be clobbered by callers.
In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 164-178: The env/CLI truthiness check for enabling Langfuse
currently tests only ("1","true","yes"); update the logic around use_langfuse
(which uses eval_args.langfuse and LIGHTSPEED_USE_LANGFUSE) to accept additional
common truthy tokens such as "on", "y", and "t" (or use a standard parser like
distutils.util.strtobool with proper ValueError handling) before calling
build_langfuse_on_complete_callback; ensure the normalized string comparison is
case-insensitive and applied in the same place where use_langfuse is computed so
build_langfuse_on_complete_callback() behavior remains unchanged.
In `@tests/unit/pipeline/evaluation/test_pipeline.py`:
- Around line 163-260: Add a new unit test (e.g.,
test_on_complete_defaults_run_name_and_original_data_path_none) that mirrors the
existing on_complete tests but calls EvaluationPipeline.run_evaluation WITHOUT
providing original_data_path; patch the same collaborators (MetricManager,
APIDataAmender, EvaluationErrorHandler, ScriptExecutionManager,
MetricsEvaluator, and ConversationProcessor returning a mock processor that
yields a mock EvaluationResult), register an on_complete callback that captures
the (results, ctx) pair, invoke pipeline.run_evaluation(sample_evaluation_data,
on_complete=on_complete) and assert the callback ran and that ctx.run_name ==
"evaluation" and ctx.original_data_path is None so the fallback behavior is
verified.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 52c593a5-6762-4b3f-8cca-c7df85afe90f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (18)
README.mdpyproject.tomlrequirements-all-extras.txtrequirements-local-embeddings.txtrequirements-nlp-metrics.txtrequirements.txtsrc/lightspeed_evaluation/__init__.pysrc/lightspeed_evaluation/api.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/evaluation_run_context.pysrc/lightspeed_evaluation/integrations/__init__.pysrc/lightspeed_evaluation/integrations/langfuse_reporter.pysrc/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/runner/evaluation.pytests/unit/integrations/test_langfuse_reporter.pytests/unit/pipeline/evaluation/test_pipeline.pytests/unit/runner/test_evaluation.pytests/unit/test_api.py
✅ Files skipped from review due to trivial changes (9)
- requirements-local-embeddings.txt
- requirements.txt
- requirements-nlp-metrics.txt
- src/lightspeed_evaluation/core/models/evaluation_run_context.py
- src/lightspeed_evaluation/core/models/init.py
- src/lightspeed_evaluation/integrations/init.py
- README.md
- pyproject.toml
- tests/unit/integrations/test_langfuse_reporter.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lightspeed_evaluation/init.py
- tests/unit/test_api.py
|
@corerabbitai review |
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Chores
Tests