Added langfuse compatibility with the lightspeed-eval by bsatapat-jpg · Pull Request #223 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2026-04-24T07:03:52Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Codex-5.0
Generated by: Cursor

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Optional Langfuse integration to emit evaluation traces with per‑metric scores; new CLI flag and env var to enable it.
- Evaluation APIs now accept evaluation_data_path and an on_complete callback; EvaluationRunContext exposed for run metadata.
Documentation
- README documents Langfuse setup, credentials, activation, and usage pattern.
Chores
- Added optional langfuse extra and updated pinned dependency versions.
Tests
- Added tests for Langfuse reporter and callback behavior.

coderabbitai · 2026-04-24T07:04:04Z

Warning

Rate limit exceeded

@bsatapat-jpg has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 51 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 51 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df322c13-41d1-4b68-a5f9-53999650f99a

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8a136 and ffd22e3.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

README.md
pyproject.toml
requirements-all-extras.txt
requirements-local-embeddings.txt
requirements-nlp-metrics.txt
requirements.txt
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

Walkthrough

Adds optional Langfuse trace reporting via an on_complete callback: new EvaluationRunContext model, Langfuse reporter utilities, API/runner/pipeline plumbing to accept and invoke on_complete, CLI/env opt-in, README and dependency updates, and accompanying tests.

Changes

Cohort / File(s)	Summary
Documentation & Packaging `README.md`, `pyproject.toml`, `requirements*.txt`, `requirements-all-extras.txt`	Adds README docs for Langfuse integration, introduces `langfuse` optional dependency (v2 range) and pins Langfuse in exported requirements; updates packaging/dev deps and several requirements pins.
Core Models & Exports `src/lightspeed_evaluation/core/models/evaluation_run_context.py`, `src/lightspeed_evaluation/core/models/__init__.py`, `src/lightspeed_evaluation/__init__.py`	Adds immutable `EvaluationRunContext` dataclass and re-exports it from models and package-level lazy imports.
Public API Wrappers `src/lightspeed_evaluation/api.py`	Adds keyword-only `evaluation_data_path` and `on_complete` parameters to evaluation wrapper functions and updates typings/docstrings to include callback `(results, EvaluationRunContext)`.
Pipeline & Runner `src/lightspeed_evaluation/pipeline/evaluation/pipeline.py`, `src/lightspeed_evaluation/runner/evaluation.py`	`EvaluationPipeline.run_evaluation` gains an `on_complete` callback, builds `EvaluationRunContext` and invokes callback (exceptions logged). Runner adds `--langfuse` CLI flag, env var support, and passes `evaluation_data_path`/`on_complete` into evaluate().
Langfuse Integration `src/lightspeed_evaluation/integrations/__init__.py`, `src/lightspeed_evaluation/integrations/langfuse_reporter.py`	New integrations package exposing `build_langfuse_on_complete_callback` and `push_evaluation_results_to_langfuse`; reporter conditionally imports Langfuse, normalizes config, creates one trace per run, emits per-metric scores, and safely logs failures.
Tests `tests/unit/integrations/test_langfuse_reporter.py`, `tests/unit/pipeline/evaluation/test_pipeline.py`, `tests/unit/runner/test_evaluation.py`, `tests/unit/test_api.py`	Adds unit tests for Langfuse reporter (mocked SDK) and updates pipeline/runner/api tests to assert callback wiring, evaluation_data_path propagation, and callback exception tolerance.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as evaluate()
    participant Pipeline as EvaluationPipeline
    participant Callback as on_complete
    participant Langfuse as Langfuse Client

    Client->>API: evaluate(..., evaluation_data_path, on_complete)
    API->>Pipeline: run_evaluation(..., original_data_path, on_complete)
    Pipeline->>Pipeline: perform evaluation -> results
    Pipeline->>Callback: on_complete(results, EvaluationRunContext)
    Callback->>Langfuse: init client & create trace (from run_name)
    Callback->>Langfuse: emit score per metric
    Callback->>Langfuse: flush()
    Pipeline-->>API: return results
    API-->>Client: return results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Added langfuse compatibility with the lightspeed-eval' accurately describes the main change: adding Langfuse integration support to the lightspeed-evaluation package.
Docstring Coverage	✅ Passed	Docstring coverage is 85.37% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (8)

src/lightspeed_evaluation/integrations/langfuse_reporter.py (2)
84-90: User context.metadata can silently overwrite reserved trace keys.

trace_meta.update(context.metadata or {}) runs after run_name, result_count, and original_data_path are set, so any of those keys passed by a caller in metadata will clobber the values the reporter is trying to guarantee. Inverting the merge (user first, reserved last) preserves the invariant.
Proposed fix
-    trace_meta: dict[str, Any] = {
-        "run_name": context.run_name,
-        "result_count": len(results),
-    }
-    if context.original_data_path is not None:
-        trace_meta["original_data_path"] = context.original_data_path
-    trace_meta.update(context.metadata or {})
+    trace_meta: dict[str, Any] = dict(context.metadata or {})
+    trace_meta["run_name"] = context.run_name
+    trace_meta["result_count"] = len(results)
+    if context.original_data_path is not None:
+        trace_meta["original_data_path"] = context.original_data_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 84
- 90, The merge order currently allows user-supplied context.metadata to
overwrite reserved keys because trace_meta is populated first then
trace_meta.update(context.metadata or {}); reverse the merge so you start with
user metadata (e.g., initialize trace_meta = dict(context.metadata or {})), then
set/override the reserved keys ("run_name", "result_count", and optionally
"original_data_path") afterwards so the reporter's guaranteed values (in the
trace_meta dict used by the Langfuse reporter) cannot be clobbered by
context.metadata; ensure you still handle context.original_data_path None-check
and preserve existing typing for trace_meta.
34-47: Docstring drift: comment format no longer includes explicit pass/fail wording.

The Args docstring says the comment contains "pass/fail status and a truncated reason", but _format_comment emits result=<r.result> (where r.result is one of PASS/FAIL/ERROR/SKIPPED). That's functionally equivalent but not what the docstring promises — a reader parsing Langfuse comments for a literal "pass"/"fail" substring will be misled for ERROR/SKIPPED rows. Suggest clarifying: "a result=<STATUS> field and optional truncated reason".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 34
- 47, Update the docstring to reflect the actual comment format emitted by
_format_comment: change the phrase that says the comment contains "pass/fail
status and a truncated reason" to indicate it emits a "result=<STATUS>" field
(where STATUS is PASS/FAIL/ERROR/SKIPPED) and an optional truncated reason;
mention that PASS/FAIL correspond to pass/fail while ERROR/SKIPPED are reported
as those literal statuses so downstream parsers should look for result=<STATUS>
rather than the words "pass"/"fail".
tests/unit/test_api.py (1)
47-49: Assertions correctly distinguish the wrapper boundary (evaluation_data_path) from the pipeline boundary (original_data_path).

The naming difference is subtle but matches src/lightspeed_evaluation/api.py where evaluate(...) forwards evaluation_data_path=... into pipeline.run_evaluation(original_data_path=...). Good coverage of the default (None) path.

One gap: there is no test covering a non-None on_complete being forwarded through evaluate/evaluate_conversation/evaluate_turn — worth adding a single assertion (e.g. on_complete=callback_sentinel) to lock in the pass-through contract the Langfuse integration relies on.

Also applies to: 116-122, 135-141, 293-300
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_api.py` around lines 47 - 49, Add assertions to the existing
unit tests to verify that a non-None on_complete callback is forwarded through
the API wrapper into the pipeline; specifically, in tests that call evaluate,
evaluate_conversation, and evaluate_turn (which currently assert
mock_pipeline.run_evaluation.called with original_data_path=None), add a
sentinel (e.g., callback_sentinel) passed as on_complete to the evaluate* call
and assert mock_pipeline.run_evaluation.assert_called_with(...,
on_complete=callback_sentinel) or the equivalent for run_conversation/run_turn;
update the assertions at the spots mentioned (around lines 47, 116-122, 135-141,
293-300) to include the on_complete=callback_sentinel check so the tests lock in
the pass-through behavior.
tests/unit/runner/test_evaluation.py (1)
52-64: Test updates mirror the runner signature change correctly.

One follow-up worth considering: there is no test in this file that exercises _make_eval_args(langfuse=True) end-to-end to assert that on_complete becomes non-None in the evaluate(...) call. The callback wiring is arguably the most error-prone part of the runner change; a single parametrized case here (mocking build_langfuse_on_complete_callback) would catch regressions cheaply.

Also applies to: 170-176, 231-237
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/runner/test_evaluation.py` around lines 52 - 64, Add a
parametrized unit test that exercises _make_eval_args(langfuse=True) and
verifies the runner wires the Langfuse callback into evaluate: mock
build_langfuse_on_complete_callback to return a sentinel callback, call
evaluate(...) with args from _make_eval_args(langfuse=True), and assert evaluate
was invoked with on_complete equal to the sentinel (non-None). Locate uses of
_make_eval_args, evaluate, and build_langfuse_on_complete_callback in this test
module to add the new case (or extend the existing parametrized cases around
lines where _make_eval_args is used) so regressions in callback wiring are
caught.
src/lightspeed_evaluation/integrations/__init__.py (1)
3-11: Re-exports look clean.

Note: because this package __init__.py unconditionally imports from langfuse_reporter, importing lightspeed_evaluation.integrations pulls in the reporter module. That's fine today because langfuse_reporter itself only lazy-imports langfuse inside the function body — but if anyone later moves the import langfuse to module scope, from lightspeed_evaluation.integrations import ... would start failing without the optional extra installed. Worth a short comment or a try/except around the re-export to make the "opt-in per extra" promise robust to future edits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/__init__.py` around lines 3 - 11, The
module-level import of build_langfuse_on_complete_callback and
push_evaluation_results_to_langfuse makes importing
lightspeed_evaluation.integrations risk raising if langfuse becomes a hard
dependency; update src/lightspeed_evaluation/integrations/__init__.py to guard
the re-exports by either wrapping the import of langfuse_reporter in a
try/except ImportError (and only add the two symbols to __all__ when import
succeeds) or add a clear comment explaining the deliberate lazy-import behavior
so future edits preserve the optional-extra contract; reference the imported
symbols build_langfuse_on_complete_callback and
push_evaluation_results_to_langfuse and the module name langfuse_reporter when
making the change.
README.md (1)
494-496: Consider adding a version note for the langfuse extra.

The runtime module docstring pins to langfuse>=2,<3, but that constraint isn't surfaced here. A one-line note (e.g. "Requires Langfuse SDK v2") would prevent users from hitting the runtime failure described in the langfuse_reporter.py comment if they already have a v3 client installed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 494 - 496, Add a one-line note in the "Optional:
Langfuse" README section stating the required Langfuse SDK version (e.g.,
"Requires Langfuse SDK v2: install with lightspeed-evaluation[langfuse] and
langfuse>=2,<3") so users don't accidentally use v3; reference the runtime
module pin (langfuse>=2,<3) and the build_langfuse_on_complete_callback symbol
in lightspeed_evaluation.integrations.langfuse_reporter to align the doc with
the actual runtime constraint.
tests/unit/integrations/test_langfuse_reporter.py (1)
51-78: Minor: caplog.set_level("ERROR") is set but never asserted on.

Both test_build_callback_invokes_push and test_push_with_mock_langfuse take the caplog fixture and set a level but never inspect caplog.records / caplog.text. Either drop the fixture to keep the tests focused, or add an assertion that no ERROR-level records were emitted on the success path — that would actually protect against regressions in the callback's "catch and log" branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/integrations/test_langfuse_reporter.py` around lines 51 - 78, The
tests test_build_callback_invokes_push and test_push_with_mock_langfuse set
caplog.set_level("ERROR") but never assert against it; either remove the unused
caplog fixture from those test signatures or add an explicit assertion (e.g.,
assert not any(r.levelname == "ERROR" for r in caplog.records) or assert "ERROR"
not in caplog.text) after calling cb(...) /
push_evaluation_results_to_langfuse(...) to ensure no ERROR-level logs were
emitted; update the function signatures and references accordingly
(test_build_callback_invokes_push, test_push_with_mock_langfuse, caplog).
pyproject.toml (1)
55-66: Heads-up: langfuse<3 pins the legacy v2 SDK.

The Langfuse Python SDK v3 (OTEL-based) has been GA since June 2025 and v4 is the current major on PyPI (langfuse==4.5.0). v2 is documented as the legacy SDK and its public API (Langfuse().trace().score()) is not forward-compatible — that shape is used throughout src/lightspeed_evaluation/integrations/langfuse_reporter.py and the tests. Pinning <3.0.0 is fine for now (and your inline comment signals intent), but track this as tech debt: a future v4 bump will require rewriting the reporter to use propagate_attributes / start_observation / score.create.

No action required in this PR unless you want to target v3/v4 directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 55 - 66, The project currently pins langfuse<3
(legacy v2); if you later want to upgrade to the OTEL-based v3/v4 SDK, create a
tech-debt ticket and update pyproject.toml to a v4 pin (e.g.,
"langfuse>=4.0.0,<5.0.0"), then refactor
src/lightspeed_evaluation/integrations/langfuse_reporter.py to replace the v2
API calls (Langfuse().trace().score()) with the v3/v4 APIs (use
propagate_attributes, start_observation and score.create) and update affected
tests to the new API surface; if you prefer to keep v2 for now, add a clear TODO
comment near the langfuse entry in pyproject.toml referencing the ticket so this
migration is tracked.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Line 25: Replace lint suppressions by restructuring and typing: create a small
dataclass (e.g., LangfuseClientConfig) and accept that or **client_kwargs in
push_evaluation_results_to_langfuse to eliminate the too-many-arguments disable;
perform the optional import at module top-level (or use importlib.find_spec at
import time) and set a _HAS_LANGFUSE flag instead of importing inside the
function to remove import-outside-toplevel; add a TYPE_CHECKING-gated import for
langfuse and annotate the local variable as langfuse: "Langfuse" (or import its
types) to remove the # type: ignore comments; and replace the broad except with
the concrete Langfuse exception type (e.g., langfuse.errors.Error) or
requests.RequestException where appropriate to avoid pylint:
disable=broad-exception-caught.
- Around line 102-110: The current loop always sends a numeric 0.0 when
EvaluationResult.r.score is None, conflating missing/ERROR/SKIPPED with real
zeros; update the loop that builds name/_score_name and comment/_format_comment
so that if r.score is None you do not call trace.score(name=..., value=..., ...)
with a 0.0 — instead either skip calling trace.score for that item and record
the status explicitly (r.result) in the comment/metadata, or call trace with an
explicit non-numeric indication supported by Langfuse (e.g., omit value and add
status to comment/metadata) so ERROR and SKIPPED remain distinguishable from
genuine low numeric scores.

---

Nitpick comments:
In `@pyproject.toml`:
- Around line 55-66: The project currently pins langfuse<3 (legacy v2); if you
later want to upgrade to the OTEL-based v3/v4 SDK, create a tech-debt ticket and
update pyproject.toml to a v4 pin (e.g., "langfuse>=4.0.0,<5.0.0"), then
refactor src/lightspeed_evaluation/integrations/langfuse_reporter.py to replace
the v2 API calls (Langfuse().trace().score()) with the v3/v4 APIs (use
propagate_attributes, start_observation and score.create) and update affected
tests to the new API surface; if you prefer to keep v2 for now, add a clear TODO
comment near the langfuse entry in pyproject.toml referencing the ticket so this
migration is tracked.

In `@README.md`:
- Around line 494-496: Add a one-line note in the "Optional: Langfuse" README
section stating the required Langfuse SDK version (e.g., "Requires Langfuse SDK
v2: install with lightspeed-evaluation[langfuse] and langfuse>=2,<3") so users
don't accidentally use v3; reference the runtime module pin (langfuse>=2,<3) and
the build_langfuse_on_complete_callback symbol in
lightspeed_evaluation.integrations.langfuse_reporter to align the doc with the
actual runtime constraint.

In `@src/lightspeed_evaluation/integrations/__init__.py`:
- Around line 3-11: The module-level import of
build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse
makes importing lightspeed_evaluation.integrations risk raising if langfuse
becomes a hard dependency; update
src/lightspeed_evaluation/integrations/__init__.py to guard the re-exports by
either wrapping the import of langfuse_reporter in a try/except ImportError (and
only add the two symbols to __all__ when import succeeds) or add a clear comment
explaining the deliberate lazy-import behavior so future edits preserve the
optional-extra contract; reference the imported symbols
build_langfuse_on_complete_callback and push_evaluation_results_to_langfuse and
the module name langfuse_reporter when making the change.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 84-90: The merge order currently allows user-supplied
context.metadata to overwrite reserved keys because trace_meta is populated
first then trace_meta.update(context.metadata or {}); reverse the merge so you
start with user metadata (e.g., initialize trace_meta = dict(context.metadata or
{})), then set/override the reserved keys ("run_name", "result_count", and
optionally "original_data_path") afterwards so the reporter's guaranteed values
(in the trace_meta dict used by the Langfuse reporter) cannot be clobbered by
context.metadata; ensure you still handle context.original_data_path None-check
and preserve existing typing for trace_meta.
- Around line 34-47: Update the docstring to reflect the actual comment format
emitted by _format_comment: change the phrase that says the comment contains
"pass/fail status and a truncated reason" to indicate it emits a
"result=<STATUS>" field (where STATUS is PASS/FAIL/ERROR/SKIPPED) and an
optional truncated reason; mention that PASS/FAIL correspond to pass/fail while
ERROR/SKIPPED are reported as those literal statuses so downstream parsers
should look for result=<STATUS> rather than the words "pass"/"fail".

In `@tests/unit/integrations/test_langfuse_reporter.py`:
- Around line 51-78: The tests test_build_callback_invokes_push and
test_push_with_mock_langfuse set caplog.set_level("ERROR") but never assert
against it; either remove the unused caplog fixture from those test signatures
or add an explicit assertion (e.g., assert not any(r.levelname == "ERROR" for r
in caplog.records) or assert "ERROR" not in caplog.text) after calling cb(...) /
push_evaluation_results_to_langfuse(...) to ensure no ERROR-level logs were
emitted; update the function signatures and references accordingly
(test_build_callback_invokes_push, test_push_with_mock_langfuse, caplog).

In `@tests/unit/runner/test_evaluation.py`:
- Around line 52-64: Add a parametrized unit test that exercises
_make_eval_args(langfuse=True) and verifies the runner wires the Langfuse
callback into evaluate: mock build_langfuse_on_complete_callback to return a
sentinel callback, call evaluate(...) with args from
_make_eval_args(langfuse=True), and assert evaluate was invoked with on_complete
equal to the sentinel (non-None). Locate uses of _make_eval_args, evaluate, and
build_langfuse_on_complete_callback in this test module to add the new case (or
extend the existing parametrized cases around lines where _make_eval_args is
used) so regressions in callback wiring are caught.

In `@tests/unit/test_api.py`:
- Around line 47-49: Add assertions to the existing unit tests to verify that a
non-None on_complete callback is forwarded through the API wrapper into the
pipeline; specifically, in tests that call evaluate, evaluate_conversation, and
evaluate_turn (which currently assert mock_pipeline.run_evaluation.called with
original_data_path=None), add a sentinel (e.g., callback_sentinel) passed as
on_complete to the evaluate* call and assert
mock_pipeline.run_evaluation.assert_called_with(...,
on_complete=callback_sentinel) or the equivalent for run_conversation/run_turn;
update the assertions at the spots mentioned (around lines 47, 116-122, 135-141,
293-300) to include the on_complete=callback_sentinel check so the tests lock in
the pass-through behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa8520d4-455b-4d02-90d7-970ce2235866

📥 Commits

Reviewing files that changed from the base of the PR and between fd9ee3a and 3c5861c.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (14)

README.md
pyproject.toml
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/runner/test_evaluation.py (1)
52-64: ⚠️ Potential issue | 🟡 Minor

Missing test coverage for the --langfuse / LIGHTSPEED_USE_LANGFUSE opt-in.

_make_eval_args now takes langfuse=False, and the two assert_called_once_with updates confirm the default path passes on_complete=None. However, there’s no test exercising the enabled path:

run_evaluation with langfuse=True should import build_langfuse_on_complete_callback and pass a non-None callback into lightspeed_evaluation.api.evaluate.

Same with LIGHTSPEED_USE_LANGFUSE=1 (and the accepted true/yes variants) in the environment.

main() should populate argparse.Namespace.langfuse from the new --langfuse flag.

Given this is the only public entry point for the new feature, at minimum a test that patches lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback and asserts the returned callable is forwarded would be worthwhile.

Want me to draft these test cases?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/runner/test_evaluation.py` around lines 52 - 64, Tests are missing
for the langfuse opt-in path: add unit tests that call run_evaluation (and main
via invoking the CLI entry point) with langfuse=True and with
LIGHTSPEED_USE_LANGFUSE set (e.g., "1"/"true"/"yes") that patch
lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback
to return a sentinel callable and patch lightspeed_evaluation.api.evaluate to
capture its on_complete arg; assert evaluate was called with on_complete is the
sentinel (non-None) when langfuse is enabled and assert _make_eval_args and main
populate argparse.Namespace.langfuse correctly from the --langfuse flag and the
environment variable.

🧹 Nitpick comments (5)

src/lightspeed_evaluation/api.py (1)
41-49: Coding-guideline violation: # pylint: disable=too-many-arguments across public wrappers.

Per repo guidelines for **/*.py, lint suppressions should be replaced by fixing the underlying issue rather than silencing it. All six wrappers (evaluate, evaluate_with_summary, evaluate_conversation, evaluate_conversation_with_summary, evaluate_turn, evaluate_turn_with_summary) now carry this disable. Since the wrappers differ only in their payload shape, a small options object is a clean fix that removes every disable at once.
# In core.models (or alongside EvaluationRunContext)
from dataclasses import dataclass

`@dataclass`(frozen=True, slots=True)
class EvaluateOptions:
    output_dir: Optional[str] = None
    evaluation_data_path: Optional[str] = None
    on_complete: Optional[Callable[[list[EvaluationResult], EvaluationRunContext], None]] = None
    compute_confidence_intervals: bool = False
Wrappers would then take options: EvaluateOptions = EvaluateOptions() and forward a single object, dropping every # pylint: disable.

As per coding guidelines: "Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead".

Also applies to Lines 86, 129, 162, 199, 245.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/api.py` around lines 41 - 49, Replace the
multiple-argument public wrapper signatures that use "# pylint:
disable=too-many-arguments" (evaluate, evaluate_with_summary,
evaluate_conversation, evaluate_conversation_with_summary, evaluate_turn,
evaluate_turn_with_summary) with a single options object parameter: define an
immutable dataclass EvaluateOptions (with fields output_dir,
evaluation_data_path, on_complete, compute_confidence_intervals, etc.) and
change each wrapper to accept options: EvaluateOptions = EvaluateOptions() and
forward options to the underlying implementation, removing the lint disables;
update all call sites in these wrappers to read options.<continue/>
src/lightspeed_evaluation/integrations/langfuse_reporter.py (1)
139-155: User-supplied context.metadata silently clobbers internal trace keys.

trace_meta.update(context.metadata or {}) is performed after setting run_name, result_count, and original_data_path, so a caller who happens to use any of those keys in context.metadata will overwrite the framework-set values on the Langfuse trace. If that is intentional (callers should be able to override), the code is fine; otherwise consider reversing the precedence or namespacing user metadata.
♻️ Suggested precedence reversal
-    trace_meta: dict[str, Any] = {
-        "run_name": context.run_name,
-        "result_count": len(results),
-    }
-    if context.original_data_path is not None:
-        trace_meta["original_data_path"] = context.original_data_path
-    trace_meta.update(context.metadata or {})
+    trace_meta: dict[str, Any] = dict(context.metadata or {})
+    trace_meta["run_name"] = context.run_name
+    trace_meta["result_count"] = len(results)
+    if context.original_data_path is not None:
+        trace_meta["original_data_path"] = context.original_data_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py` around lines 139
- 155, The trace metadata merge currently lets user-supplied context.metadata
overwrite framework keys because trace_meta.update(context.metadata or {}) runs
after setting run_name/result_count/original_data_path; change the precedence so
framework keys win by applying user metadata first and then updating/overwriting
with the framework values (i.e., build trace_meta from context.metadata or {}
then set trace_meta["run_name"]=context.run_name,
trace_meta["result_count"]=len(results), and optionally set original_data_path
if present) so that the values used when creating the langfuse.trace (see
trace_meta and trace = langfuse.trace(...)) cannot be clobbered by callers.
tests/unit/pipeline/evaluation/test_pipeline.py (1)
163-260: Good coverage; consider one more edge case.

The two tests cover the happy-path invocation and exception isolation well. Consider a third that calls run_evaluation without original_data_path to lock in that EvaluationRunContext.run_name falls back to "evaluation" and original_data_path is None — this cements the fallback contract that Langfuse traces will rely on.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/pipeline/evaluation/test_pipeline.py` around lines 163 - 260, Add
a new unit test (e.g.,
test_on_complete_defaults_run_name_and_original_data_path_none) that mirrors the
existing on_complete tests but calls EvaluationPipeline.run_evaluation WITHOUT
providing original_data_path; patch the same collaborators (MetricManager,
APIDataAmender, EvaluationErrorHandler, ScriptExecutionManager,
MetricsEvaluator, and ConversationProcessor returning a mock processor that
yields a mock EvaluationResult), register an on_complete callback that captures
the (results, ctx) pair, invoke pipeline.run_evaluation(sample_evaluation_data,
on_complete=on_complete) and assert the callback ran and that ctx.run_name ==
"evaluation" and ctx.original_data_path is None so the fallback behavior is
verified.
src/lightspeed_evaluation/runner/evaluation.py (1)
164-178: Minor: accepted truthy values don’t match common conventions.

("1", "true", "yes") covers the help text’s example but omits on, which some users expect (and is accepted by e.g. distutils.util.strtobool). Consider also handling on / y / t for symmetry, or document the exact accepted set in the --langfuse help.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/runner/evaluation.py` around lines 164 - 178, The
env/CLI truthiness check for enabling Langfuse currently tests only
("1","true","yes"); update the logic around use_langfuse (which uses
eval_args.langfuse and LIGHTSPEED_USE_LANGFUSE) to accept additional common
truthy tokens such as "on", "y", and "t" (or use a standard parser like
distutils.util.strtobool with proper ValueError handling) before calling
build_langfuse_on_complete_callback; ensure the normalized string comparison is
case-insensitive and applied in the same place where use_langfuse is computed so
build_langfuse_on_complete_callback() behavior remains unchanged.
requirements-all-extras.txt (1)
223-224: Consider targeting Langfuse v4.

langfuse==2.60.10 pins to a deprecated v2 release—Langfuse v2 documentation was removed in August 2025, marking its end of life. Langfuse v3 (released June 2025) introduced OpenTelemetry-based observability; however, v4.x (released March 2026) is now current with an observation-centric data model and is the recommended upgrade path. Since this integration is new, target v4 directly (new extra langfuse>=4). If v2 is retained deliberately for compatibility reasons, add a note in the reporter module docstring explaining the rationale and include a tracking issue to migrate.

This is auto-generated from pyproject.toml, so the real change belongs there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements-all-extras.txt` around lines 223 - 224, The project is pinned to
langfuse==2.60.10 (v2); update the dependency to target the current major
release (langfuse>=4) in the source of truth (pyproject.toml) so the
autogenerated requirements-all-extras.txt will reflect langfuse>=4, or if you
must keep v2 for compatibility, add a clear explanatory note and migration
tracking issue in the reporter module docstring (referencing the reporter module
name) that states why v2 is retained and links the issue to migrate to langfuse
v4.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 49-81: The docstring promises Langfuse client errors will be
logged and not re-raised, but only client creation is currently protected; wrap
the trace/score/flush interaction so exceptions are caught and logged instead of
propagated. Specifically, in push_evaluation_results_to_langfuse (or inside
_write_langfuse_trace_and_scores) add a try/except around calls that invoke
client.trace(...), trace.score(...), and client.flush() (or any langfuse API
calls) and on exception call logger.error with contextual info and return
gracefully; keep the client creation logic as-is and do not re-raise the caught
exceptions so the implementation matches the docstring.
- Around line 23-26: The code is calling Langfuse v1-style APIs
(client.trace(...), trace.score(...), client.flush()) which do not exist in
langfuse>=2.0.0; update the implementation in langfuse_reporter.py to use the v2
observation API: replace client.trace(name=..., metadata=..., tags=...) with a
context created via client.start_as_current_observation(name=..., metadata=...,
tags=...) (or the recommended observation factory), switch trace.score(name=...,
value=..., comment=...) to client.create_score(...) or use the returned
observation/span's score method per v2 (e.g., span.score_trace or
client.create_score with the observation id), and remove calls to client.flush()
(v2 has no explicit flush). Also ensure the Langfuse import/optional check still
works and update any variable names (e.g., trace -> observation/span) to match
the v2 objects so tests and runtime use the correct v2 method signatures.

---

Outside diff comments:
In `@tests/unit/runner/test_evaluation.py`:
- Around line 52-64: Tests are missing for the langfuse opt-in path: add unit
tests that call run_evaluation (and main via invoking the CLI entry point) with
langfuse=True and with LIGHTSPEED_USE_LANGFUSE set (e.g., "1"/"true"/"yes") that
patch
lightspeed_evaluation.integrations.langfuse_reporter.build_langfuse_on_complete_callback
to return a sentinel callable and patch lightspeed_evaluation.api.evaluate to
capture its on_complete arg; assert evaluate was called with on_complete is the
sentinel (non-None) when langfuse is enabled and assert _make_eval_args and main
populate argparse.Namespace.langfuse correctly from the --langfuse flag and the
environment variable.

---

Nitpick comments:
In `@requirements-all-extras.txt`:
- Around line 223-224: The project is pinned to langfuse==2.60.10 (v2); update
the dependency to target the current major release (langfuse>=4) in the source
of truth (pyproject.toml) so the autogenerated requirements-all-extras.txt will
reflect langfuse>=4, or if you must keep v2 for compatibility, add a clear
explanatory note and migration tracking issue in the reporter module docstring
(referencing the reporter module name) that states why v2 is retained and links
the issue to migrate to langfuse v4.

In `@src/lightspeed_evaluation/api.py`:
- Around line 41-49: Replace the multiple-argument public wrapper signatures
that use "# pylint: disable=too-many-arguments" (evaluate,
evaluate_with_summary, evaluate_conversation,
evaluate_conversation_with_summary, evaluate_turn, evaluate_turn_with_summary)
with a single options object parameter: define an immutable dataclass
EvaluateOptions (with fields output_dir, evaluation_data_path, on_complete,
compute_confidence_intervals, etc.) and change each wrapper to accept options:
EvaluateOptions = EvaluateOptions() and forward options to the underlying
implementation, removing the lint disables; update all call sites in these
wrappers to read options.<continue/>

In `@src/lightspeed_evaluation/integrations/langfuse_reporter.py`:
- Around line 139-155: The trace metadata merge currently lets user-supplied
context.metadata overwrite framework keys because
trace_meta.update(context.metadata or {}) runs after setting
run_name/result_count/original_data_path; change the precedence so framework
keys win by applying user metadata first and then updating/overwriting with the
framework values (i.e., build trace_meta from context.metadata or {} then set
trace_meta["run_name"]=context.run_name,
trace_meta["result_count"]=len(results), and optionally set original_data_path
if present) so that the values used when creating the langfuse.trace (see
trace_meta and trace = langfuse.trace(...)) cannot be clobbered by callers.

In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 164-178: The env/CLI truthiness check for enabling Langfuse
currently tests only ("1","true","yes"); update the logic around use_langfuse
(which uses eval_args.langfuse and LIGHTSPEED_USE_LANGFUSE) to accept additional
common truthy tokens such as "on", "y", and "t" (or use a standard parser like
distutils.util.strtobool with proper ValueError handling) before calling
build_langfuse_on_complete_callback; ensure the normalized string comparison is
case-insensitive and applied in the same place where use_langfuse is computed so
build_langfuse_on_complete_callback() behavior remains unchanged.

In `@tests/unit/pipeline/evaluation/test_pipeline.py`:
- Around line 163-260: Add a new unit test (e.g.,
test_on_complete_defaults_run_name_and_original_data_path_none) that mirrors the
existing on_complete tests but calls EvaluationPipeline.run_evaluation WITHOUT
providing original_data_path; patch the same collaborators (MetricManager,
APIDataAmender, EvaluationErrorHandler, ScriptExecutionManager,
MetricsEvaluator, and ConversationProcessor returning a mock processor that
yields a mock EvaluationResult), register an on_complete callback that captures
the (results, ctx) pair, invoke pipeline.run_evaluation(sample_evaluation_data,
on_complete=on_complete) and assert the callback ran and that ctx.run_name ==
"evaluation" and ctx.original_data_path is None so the fallback behavior is
verified.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 52c593a5-6762-4b3f-8cca-c7df85afe90f

📥 Commits

Reviewing files that changed from the base of the PR and between 3c5861c and 2a8a136.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

README.md
pyproject.toml
requirements-all-extras.txt
requirements-local-embeddings.txt
requirements-nlp-metrics.txt
requirements.txt
src/lightspeed_evaluation/__init__.py
src/lightspeed_evaluation/api.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/integrations/__init__.py
src/lightspeed_evaluation/integrations/langfuse_reporter.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/integrations/test_langfuse_reporter.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/test_api.py

✅ Files skipped from review due to trivial changes (9)

requirements-local-embeddings.txt
requirements.txt
requirements-nlp-metrics.txt
src/lightspeed_evaluation/core/models/evaluation_run_context.py
src/lightspeed_evaluation/core/models/init.py
src/lightspeed_evaluation/integrations/init.py
README.md
pyproject.toml
tests/unit/integrations/test_langfuse_reporter.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/lightspeed_evaluation/init.py
tests/unit/test_api.py

bsatapat-jpg · 2026-04-24T08:55:53Z

@corerabbitai review

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py Outdated

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py

bsatapat-jpg force-pushed the dev branch 2 times, most recently from cac2b29 to 2a8a136 Compare April 24, 2026 08:08

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py

Comment thread src/lightspeed_evaluation/integrations/langfuse_reporter.py

Added langfuse compatibility with the lightspeed-eval

ffd22e3

bsatapat-jpg force-pushed the dev branch from 2a8a136 to ffd22e3 Compare April 24, 2026 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added langfuse compatibility with the lightspeed-eval#223

Added langfuse compatibility with the lightspeed-eval#223
bsatapat-jpg wants to merge 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

bsatapat-jpg commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

bsatapat-jpg commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bsatapat-jpg commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bsatapat-jpg commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bsatapat-jpg commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading