Support for the Deepeval dataset (LLMTestCase) for LLM tests by AnilSorathiya · Pull Request #401 · validmind/validmind-library

AnilSorathiya · 2025-07-25T15:26:23Z

Pull Request Description

This PR introduces the support for this new LLM-specific dataset type mapping to leverage a wide range of LLM tests provided by the Deepeval library.

This PR also covers the new row-level metrics that return arrays. These metrics will be used in the assign_scores interface but will not be stored in the database. Instead, the VM dataset object can hold them in memory through assign_scores. This will allow their use in generalized plots and statistical functionality, helping to document test result interpretations.

What and why?

It appears that Deepeval utilizes the output of LangChain/LangGraph workflows as input for its test cases. Once we introduce support for this new LLM-specific dataset type mapping, we will be able to leverage a wide range of LLM tests provided by the Deepeval library. This could significantly contribute to the comprehensive documentation of LLM use cases.

This PR also covers the new row-level metrics that return arrays. These metrics will be used in the assign_scores interface but will not be stored in the database. Instead, the VM dataset object can hold them in memory through assign_scores. This will allow their use in generalized plots and statistical functionality, helping to document test result interpretations.

How to test

Run test_dataset.py unit tests
notebooks/code_sharing/deepeval_integration_demo.ipynb
notebooks/how_to/assign_scores_complete_tutorial.ipynb

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

…gentic-model-in-vm-library

…-in-the-init-model-to

…nd-statistical-tests

…val-dataset-llmtestcase

tests/test_scorer_decorator.py

validmind/tests/output.py

validmind/vm_models/dataset/dataset.py

validmind/tests/decorator.py

juanmleng

Left a couple of suggestions, but nothing blocking. This looks great!

cachafla · 2025-09-10T15:23:10Z

notebooks/code_sharing/deepeval_integration_demo.ipynb

With the new assign_scores interface, is it required to provide a predit_fn? This notebook has this:

def agent_fn(input): """ Invoke the simplified agent with the given input. """ return 1.23 vm_model = vm.init_model( predict_fn=agent_fn, input_id="test_model", )

Found a small issue in this section:

# Initialize ValidMind print("Integrating with ValidMind framework...") try: # Initialize ValidMind vm.init() print("ValidMind initialized")

Error:

ERROR: ValidMind integration failed: Model ID must be provided either as an environment variable or as an argument to init. Note: Some ValidMind features may require additional setup

Why is the Custom Metrics with G-Eval section not running any tests? If not we should clarify with the user what we are trying to demonstrate on that section.

johnwalz97

One nitpick but looks good to me and worked locally

validmind/tests/run.py

cachafla · 2025-09-11T06:58:12Z

From my chat with Anil: I suggested that we can allow not passing any model to asssign_scores while guaranteeing a predictable behavior under 2 conditions. Example:

Assume a test returns:

{
    "outlier_score": float(outlier_scores[i]),
    "isolation_path": float(isolation_paths[i]),
    "anomaly_score": float(anomaly_scores[i]),
    "is_outlier": bool(outlier_predictions[i]),
}

You could call:

assign_scores("OutlierScore")
# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|

If you pass a mode, then you get:

model = init_model(input_id="xgb")
assign_scores(model, "OutlierScore")

# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|xgb_OutlierScore|xgb_isolation_path|xgb_anomaly_score|xgb_is_outlier|

But if you call assign_scores without a model again, two things could happen:

assign_scores("OutlierScore")
# ^ this would override the empty model columns?
# two options:
# 1. warn the user that they are overriding the model columns, but still override the values
# 2. raise an error, but allow some flag like override/force if they really want to

Thoughts @juanmleng @AnilSorathiya @johnwalz97?

cachafla · 2025-09-11T06:59:34Z

Ah, the other piece of feedback was to require a full path to the scorer, so something like validmind.llm.scorers.AnswerRelevancy instead of AnswerRelevancy.

AnilSorathiya · 2025-09-11T11:31:18Z

validmind.llm.scorers
fixed. Now we need full path to fun scorer.

github-actions · 2025-09-11T14:26:06Z

PR Summary

This PR delivers major improvements to the integration between DeepEval and the ValidMind library by refactoring how evaluation metrics are computed and applied to datasets. The key functional changes are:

Row-metric vs. Unit-metric Processing
- The assign_scores method in the LLMAgentDataset has been updated to support row-level (scorer) metrics. Instead of only computing scalar values that are applied uniformly across the dataset, the new implementation distinguishes between scalar (unit) metrics and row metrics. It now handles outputs that are scalars, lists, or even lists of dictionaries (for complex, per-row evaluations).
Refactoring for Scorer Integration
- The code now employs a dedicated run_scorer function (instead of the old run_metric) that calls custom scorer functions registered via a new scorer decorator. This decorator auto-generates scorer IDs when not provided and registers functions in a centralized singleton store. Multiple scorer modules and functions (including those for classification and LLM evaluation) have been updated to use the new decorator.
Improved Output Processing and Error Handling
- Several helper functions (_process_list_scorer_output, _process_dict_list_scorer_output, _process_scalar_scorer_output, etc.) have been added to robustly process different types of scorer outputs and to create dataset columns following a clear naming convention. This also includes better error messages when output lengths don’t match the dataset or when dictionary keys are inconsistent.
Test Updates and Dependency Adjustments
- The test suite has been extensively updated to reflect the changes. Tests now verify that row metrics produce varying per-row outputs (as opposed to a single scalar repeated across rows), that column naming conventions are correctly applied both when a model with an input_id is provided and when absent, and that custom scorer outputs (including dictionary outputs) are handled properly.
- Minor dependency updates (e.g., a bump in the version requirement for tabulate) and adjustments in the API client (to enforce that only scalar metric values are logged) ensure consistency across the system.

Overall, these changes improve the flexibility, robustness, and clarity of model evaluation within the ValidMind ecosystem, enabling users to leverage DeepEval’s comprehensive metrics in a structured and production-ready manner.

Test Suggestions

Run all unit tests for assign_scores to verify columns are created properly both when a model with an input_id is provided and when it is omitted.
Test custom scorer functions by returning various output types (scalars, lists, list of dictionaries) to ensure the processing functions (_process_list_scorer_output, etc.) correctly unpack and add the proper columns.
Perform integration testing on a real LLM evaluation scenario to confirm that DeepEval metrics (such as AnswerRelevancy) are computed and assigned as expected.
Verify backward compatibility by checking that legacy tests relying on unit_metrics (now replaced by scorer-based metrics) continue to produce valid and comparable results.
Introduce tests that intentionally supply mismatched scorer output lengths or inconsistent dictionary keys to ensure that proper error messages are raised.
Check the API client logging for metrics to guarantee that non-scalar values are rejected or properly handled in accordance with the updated validations.

AnilSorathiya · 2025-09-11T16:07:46Z

From my chat with Anil: I suggested that we can allow not passing any model to asssign_scores while guaranteeing a predictable behavior under 2 conditions. Example:

Assume a test returns:

{
    "outlier_score": float(outlier_scores[i]),
    "isolation_path": float(isolation_paths[i]),
    "anomaly_score": float(anomaly_scores[i]),
    "is_outlier": bool(outlier_predictions[i]),
}

You could call:

assign_scores("OutlierScore")
# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|

If you pass a mode, then you get:

model = init_model(input_id="xgb")
assign_scores(model, "OutlierScore")

# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|xgb_OutlierScore|xgb_isolation_path|xgb_anomaly_score|xgb_is_outlier|

But if you call assign_scores without a model again, two things could happen:

assign_scores("OutlierScore")
# ^ this would override the empty model columns?
# two options:
# 1. warn the user that they are overriding the model columns, but still override the values
# 2. raise an error, but allow some flag like override/force if they really want to

Thoughts @juanmleng @AnilSorathiya @johnwalz97?

I have started throwing warnings for now until we decide.

cachafla

Awesome 👏

AnilSorathiya added 30 commits June 24, 2025 11:18

support agent use case

1b3f67a

wrapper function for agent

723fcab

ragas metrics

28d9fbb

update ragas metrics

ecf8e09

fix lint error

53e8879

create helper functions

1662368

Merge branch 'main' into anilsorathiya/sc-10863/add-support-for-llm-a…

cc84cbc

…gentic-model-in-vm-library

delete old notebook

6f09780

update description for each section

0bb731e

simplify agent

e758979

simple demo notebook using langchain agent

7c35cfe

Update description of the simplified langgraph agent demo notebook

9bb70e9

add brief description to tests

894d52a

add brief description to tests

d86a9af

Allow dict return type predict_fn

884000f

update notebook and refactor utils

fbd5aa9

lint fix

daceabf

Merge branch 'main' into anilsorathiya/sc-11324/extend-the-predict-fn…

5f8823a

…-in-the-init-model-to

fix the test failure

70a5636

new unit tests for multiple columns return in assign_predictions

33b06fb

update notebooks to return multiple values in predict_fn

8e12bd2

general plotting and stats tests

e38929d

clear output

e900a65

Merge branch 'main' into anilsorathiya/sc-11380/add-generlize-plots-a…

a08e881

…nd-statistical-tests

remove duplicate tests

16f4700

update notebook

bb9f9af

Integration between deepeval and validmind

5078a7a

Merge branch 'main' into anilsorathiya/sc-11452/support-for-the-deepe…

2eb6abb

…val-dataset-llmtestcase

add MetricValues class for metric return type

ad0b719

Return MetricValues in the unit tests

94ca006

cachafla reviewed Sep 9, 2025

View reviewed changes

tests/test_scorer_decorator.py Outdated Show resolved Hide resolved

cachafla reviewed Sep 9, 2025

View reviewed changes

validmind/tests/output.py Outdated Show resolved Hide resolved

AnilSorathiya added 9 commits September 9, 2025 10:09

remove simple testcases

9c7e7e9

fix the list_scorers

bbd6cd4

update notebook

c7b83f3

remove circular dependency of load_test

a33f2a4

remove circular dependency of load_test

30c3abc

move the AnswerRelevancy scorer in deepeval namespace

e91e6e4

unit metric can return int and float only

a284cd1

update notebook

1ec1c75

fix lint error

427ddf5

juanmleng reviewed Sep 9, 2025

View reviewed changes

validmind/vm_models/dataset/dataset.py Show resolved Hide resolved

juanmleng reviewed Sep 9, 2025

View reviewed changes

validmind/tests/decorator.py Show resolved Hide resolved

juanmleng approved these changes Sep 9, 2025

View reviewed changes

AnilSorathiya added 2 commits September 10, 2025 13:27

remove scores listing from list_tests interface

917831c

add custom scorer support

58b3bde

cachafla reviewed Sep 10, 2025

View reviewed changes

johnwalz97 approved these changes Sep 10, 2025

View reviewed changes

validmind/tests/run.py Outdated Show resolved Hide resolved

full path required to run scorer

cb52104

AnilSorathiya added 3 commits September 11, 2025 13:42

remove circular dependency

36f2f96

make model parameter option in the assign_scores function

439bd1d

fix lint error

66dde16

cachafla approved these changes Sep 12, 2025

View reviewed changes

AnilSorathiya merged commit e883b6d into main Sep 12, 2025
17 checks passed

AnilSorathiya deleted the anilsorathiya/sc-11452/support-for-the-deepeval-dataset-llmtestcase branch September 12, 2025 11:52

Conversation

AnilSorathiya commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juanmleng left a comment

Choose a reason for hiding this comment

Uh oh!

cachafla Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

cachafla Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

cachafla Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

johnwalz97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cachafla commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cachafla commented Sep 11, 2025

Uh oh!

AnilSorathiya commented Sep 11, 2025

Uh oh!

github-actions bot commented Sep 11, 2025

PR Summary

Test Suggestions

Uh oh!

AnilSorathiya commented Sep 11, 2025

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AnilSorathiya commented Jul 25, 2025 •

edited

Loading

cachafla commented Sep 11, 2025 •

edited

Loading