Support for the Deepeval dataset (LLMTestCase) for LLM tests#401
Conversation
…gentic-model-in-vm-library
…-in-the-init-model-to
…nd-statistical-tests
…val-dataset-llmtestcase
juanmleng
left a comment
There was a problem hiding this comment.
Left a couple of suggestions, but nothing blocking. This looks great!
There was a problem hiding this comment.
With the new assign_scores interface, is it required to provide a predit_fn? This notebook has this:
def agent_fn(input):
"""
Invoke the simplified agent with the given input.
"""
return 1.23
vm_model = vm.init_model(
predict_fn=agent_fn,
input_id="test_model",
)There was a problem hiding this comment.
Found a small issue in this section:
# Initialize ValidMind
print("Integrating with ValidMind framework...")
try:
# Initialize ValidMind
vm.init()
print("ValidMind initialized")Error:
ERROR: ValidMind integration failed: Model ID must be provided either as an environment variable or as an argument to init.
Note: Some ValidMind features may require additional setup
There was a problem hiding this comment.
Why is the Custom Metrics with G-Eval section not running any tests? If not we should clarify with the user what we are trying to demonstrate on that section.
johnwalz97
left a comment
There was a problem hiding this comment.
One nitpick but looks good to me and worked locally
|
From my chat with Anil: I suggested that we can allow not passing any model to Assume a test returns: {
"outlier_score": float(outlier_scores[i]),
"isolation_path": float(isolation_paths[i]),
"anomaly_score": float(anomaly_scores[i]),
"is_outlier": bool(outlier_predictions[i]),
}You could call: assign_scores("OutlierScore")
# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|If you pass a mode, then you get: model = init_model(input_id="xgb")
assign_scores(model, "OutlierScore")
# dataset
|x|y|OutlierScore|isolation_path|anomaly_score|is_outlier|xgb_OutlierScore|xgb_isolation_path|xgb_anomaly_score|xgb_is_outlier|But if you call assign_scores("OutlierScore")
# ^ this would override the empty model columns?
# two options:
# 1. warn the user that they are overriding the model columns, but still override the values
# 2. raise an error, but allow some flag like override/force if they really want toThoughts @juanmleng @AnilSorathiya @johnwalz97? |
|
Ah, the other piece of feedback was to require a full path to the scorer, so something like |
|
PR SummaryThis PR delivers major improvements to the integration between DeepEval and the ValidMind library by refactoring how evaluation metrics are computed and applied to datasets. The key functional changes are:
Overall, these changes improve the flexibility, robustness, and clarity of model evaluation within the ValidMind ecosystem, enabling users to leverage DeepEval’s comprehensive metrics in a structured and production-ready manner. Test Suggestions
|
I have started throwing warnings for now until we decide. |
Pull Request Description
This PR introduces the support for this new LLM-specific dataset type mapping to leverage a wide range of LLM tests provided by the Deepeval library.
This PR also covers the new row-level metrics that return arrays. These metrics will be used in the
assign_scoresinterface but will not be stored in the database. Instead, the VM dataset object can hold them in memory throughassign_scores. This will allow their use in generalized plots and statistical functionality, helping to document test result interpretations.What and why?
It appears that Deepeval utilizes the output of LangChain/LangGraph workflows as input for its test cases. Once we introduce support for this new LLM-specific dataset type mapping, we will be able to leverage a wide range of LLM tests provided by the Deepeval library. This could significantly contribute to the comprehensive documentation of LLM use cases.
This PR also covers the new row-level metrics that return arrays. These metrics will be used in the
assign_scoresinterface but will not be stored in the database. Instead, the VM dataset object can hold them in memory throughassign_scores. This will allow their use in generalized plots and statistical functionality, helping to document test result interpretations.How to test
What needs special review?
Dependencies, breaking changes, and deployment notes
Release notes
Checklist