Skip to content

Support for the Deepeval dataset (LLMTestCase) for LLM tests#401

Merged
AnilSorathiya merged 68 commits intomainfrom
anilsorathiya/sc-11452/support-for-the-deepeval-dataset-llmtestcase
Sep 12, 2025
Merged

Support for the Deepeval dataset (LLMTestCase) for LLM tests#401
AnilSorathiya merged 68 commits intomainfrom
anilsorathiya/sc-11452/support-for-the-deepeval-dataset-llmtestcase

Conversation

@AnilSorathiya
Copy link
Contributor

@AnilSorathiya AnilSorathiya commented Jul 25, 2025

Pull Request Description

This PR introduces the support for this new LLM-specific dataset type mapping to leverage a wide range of LLM tests provided by the Deepeval library.

This PR also covers the new row-level metrics that return arrays. These metrics will be used in the assign_scores interface but will not be stored in the database. Instead, the VM dataset object can hold them in memory through assign_scores. This will allow their use in generalized plots and statistical functionality, helping to document test result interpretations.

What and why?

It appears that Deepeval utilizes the output of LangChain/LangGraph workflows as input for its test cases. Once we introduce support for this new LLM-specific dataset type mapping, we will be able to leverage a wide range of LLM tests provided by the Deepeval library. This could significantly contribute to the comprehensive documentation of LLM use cases.

This PR also covers the new row-level metrics that return arrays. These metrics will be used in the assign_scores interface but will not be stored in the database. Instead, the VM dataset object can hold them in memory through assign_scores. This will allow their use in generalized plots and statistical functionality, helping to document test result interpretations.

How to test

  • Run test_dataset.py unit tests
  • notebooks/code_sharing/deepeval_integration_demo.ipynb
  • notebooks/how_to/assign_scores_complete_tutorial.ipynb

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • [] Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

Copy link
Contributor

@juanmleng juanmleng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of suggestions, but nothing blocking. This looks great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new assign_scores interface, is it required to provide a predit_fn? This notebook has this:

def agent_fn(input):
    """
    Invoke the simplified agent with the given input.
    """

    return 1.23


vm_model = vm.init_model(
    predict_fn=agent_fn,
    input_id="test_model",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a small issue in this section:

# Initialize ValidMind
print("Integrating with ValidMind framework...")

try:
    # Initialize ValidMind
    vm.init()
    print("ValidMind initialized")

Error:

ERROR: ValidMind integration failed: Model ID must be provided either as an environment variable or as an argument to init.
Note: Some ValidMind features may require additional setup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Custom Metrics with G-Eval section not running any tests? If not we should clarify with the user what we are trying to demonstrate on that section.

Copy link
Contributor

@johnwalz97 johnwalz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nitpick but looks good to me and worked locally

@cachafla
Copy link
Contributor

cachafla commented Sep 11, 2025

From my chat with Anil: I suggested that we can allow not passing any model to asssign_scores while guaranteeing a predictable behavior under 2 conditions. Example:

Assume a test returns:

{
    "outlier_score": float(outlier_scores[i]),
    "isolation_path": float(isolation_paths[i]),
    "anomaly_score": float(anomaly_scores[i]),
    "is_outlier": bool(outlier_predictions[i]),
}

You could call:

assign_scores("OutlierScore")
# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|

If you pass a mode, then you get:

model = init_model(input_id="xgb")
assign_scores(model, "OutlierScore")

# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|xgb_OutlierScore|xgb_isolation_path|xgb_anomaly_score|xgb_is_outlier|

But if you call assign_scores without a model again, two things could happen:

assign_scores("OutlierScore")
# ^ this would override the empty model columns?
# two options:
# 1. warn the user that they are overriding the model columns, but still override the values
# 2. raise an error, but allow some flag like override/force if they really want to

Thoughts @juanmleng @AnilSorathiya @johnwalz97?

@cachafla
Copy link
Contributor

Ah, the other piece of feedback was to require a full path to the scorer, so something like validmind.llm.scorers.AnswerRelevancy instead of AnswerRelevancy.

@AnilSorathiya
Copy link
Contributor Author

validmind.llm.scorers
fixed. Now we need full path to fun scorer.

@github-actions
Copy link
Contributor

PR Summary

This PR delivers major improvements to the integration between DeepEval and the ValidMind library by refactoring how evaluation metrics are computed and applied to datasets. The key functional changes are:

  1. Row-metric vs. Unit-metric Processing

    • The assign_scores method in the LLMAgentDataset has been updated to support row-level (scorer) metrics. Instead of only computing scalar values that are applied uniformly across the dataset, the new implementation distinguishes between scalar (unit) metrics and row metrics. It now handles outputs that are scalars, lists, or even lists of dictionaries (for complex, per-row evaluations).
  2. Refactoring for Scorer Integration

    • The code now employs a dedicated run_scorer function (instead of the old run_metric) that calls custom scorer functions registered via a new scorer decorator. This decorator auto-generates scorer IDs when not provided and registers functions in a centralized singleton store. Multiple scorer modules and functions (including those for classification and LLM evaluation) have been updated to use the new decorator.
  3. Improved Output Processing and Error Handling

    • Several helper functions (_process_list_scorer_output, _process_dict_list_scorer_output, _process_scalar_scorer_output, etc.) have been added to robustly process different types of scorer outputs and to create dataset columns following a clear naming convention. This also includes better error messages when output lengths don’t match the dataset or when dictionary keys are inconsistent.
  4. Test Updates and Dependency Adjustments

    • The test suite has been extensively updated to reflect the changes. Tests now verify that row metrics produce varying per-row outputs (as opposed to a single scalar repeated across rows), that column naming conventions are correctly applied both when a model with an input_id is provided and when absent, and that custom scorer outputs (including dictionary outputs) are handled properly.
    • Minor dependency updates (e.g., a bump in the version requirement for tabulate) and adjustments in the API client (to enforce that only scalar metric values are logged) ensure consistency across the system.

Overall, these changes improve the flexibility, robustness, and clarity of model evaluation within the ValidMind ecosystem, enabling users to leverage DeepEval’s comprehensive metrics in a structured and production-ready manner.

Test Suggestions

  • Run all unit tests for assign_scores to verify columns are created properly both when a model with an input_id is provided and when it is omitted.
  • Test custom scorer functions by returning various output types (scalars, lists, list of dictionaries) to ensure the processing functions (_process_list_scorer_output, etc.) correctly unpack and add the proper columns.
  • Perform integration testing on a real LLM evaluation scenario to confirm that DeepEval metrics (such as AnswerRelevancy) are computed and assigned as expected.
  • Verify backward compatibility by checking that legacy tests relying on unit_metrics (now replaced by scorer-based metrics) continue to produce valid and comparable results.
  • Introduce tests that intentionally supply mismatched scorer output lengths or inconsistent dictionary keys to ensure that proper error messages are raised.
  • Check the API client logging for metrics to guarantee that non-scalar values are rejected or properly handled in accordance with the updated validations.

@AnilSorathiya
Copy link
Contributor Author

From my chat with Anil: I suggested that we can allow not passing any model to asssign_scores while guaranteeing a predictable behavior under 2 conditions. Example:

Assume a test returns:

{
    "outlier_score": float(outlier_scores[i]),
    "isolation_path": float(isolation_paths[i]),
    "anomaly_score": float(anomaly_scores[i]),
    "is_outlier": bool(outlier_predictions[i]),
}

You could call:

assign_scores("OutlierScore")
# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|

If you pass a mode, then you get:

model = init_model(input_id="xgb")
assign_scores(model, "OutlierScore")

# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|xgb_OutlierScore|xgb_isolation_path|xgb_anomaly_score|xgb_is_outlier|

But if you call assign_scores without a model again, two things could happen:

assign_scores("OutlierScore")
# ^ this would override the empty model columns?
# two options:
# 1. warn the user that they are overriding the model columns, but still override the values
# 2. raise an error, but allow some flag like override/force if they really want to

Thoughts @juanmleng @AnilSorathiya @johnwalz97?

I have started throwing warnings for now until we decide.

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 👏

@AnilSorathiya AnilSorathiya merged commit e883b6d into main Sep 12, 2025
17 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-11452/support-for-the-deepeval-dataset-llmtestcase branch September 12, 2025 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request highlight Feature to be curated in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants