-
Notifications
You must be signed in to change notification settings - Fork 3
[SC 12254] Add new deepeval tests in lib #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SC 12254] Add new deepeval tests in lib #426
Conversation
…gentic-model-in-vm-library
…-in-the-init-model-to
…nd-statistical-tests
…val-dataset-llmtestcase
|
Some feedback for Instead of For For |
|
For Does the For There should be a short explanation clarifying how the test knows where the input and output columns are declared. That way the end user will know how the input dataset is being used. Alternatively we can also pass I tend to think that this could apply for all uses of scorers in demo notebooks actually 🤔. Towards the end of the notebook, on the section This code is not needed because At the end of the notebook there's a cell with this text: The last cell runs each of the How does that integrate with VM? Via scorers or tests? As a user I wouldn't know how to bring those results from |
cachafla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good 🙌 apologies for the delay reviewing this.
My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.
Specifically, I'd recommend:
- Leaving this notebook as a demonstration of integration with DeepEval
LLMTestCaseandSummarizationMetric - Another notebook that demonstrates how to use
LLMAgentDatasetwith theGoldendataset fromDeepEval - Another notebook that demonstrates how to use
GEvalwith VM scorers and/or VM tests
Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.
To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.
Thoughts?
These comments look very sensible to me. Totally agree! |
This is known issue in These tests ignore the verbose_mode: |
Added separate section in the notebook: |
|
This section has been removed from the notebook. I will create a separate notebook in this PR #434 |
yes, Agree.
|
juanmleng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great notebook! Just left a small comment.
PR SummaryThis PR introduces significant enhancements and bugfixes to improve the integration between ValidMind and DeepEval. Key functional changes include:
Overall, these changes aim to streamline the testing infrastructure for LLMs, improve metric evaluations, and provide detailed insights into agent behavior through advanced scoring metrics from DeepEval. Test Suggestions
|
Pull Request Description
What and why?
deepevalnamespace, enabling evaluation of LLM outputs with standardized metrics..gitignorefor*.deepevalartifacts; remove deprecated/duplicate tests (e.g., Geval); improve plots (e.g., boxplot) and examples.How to test
notebooks/code_sharing/deepeval_integration_demo.ipynbend-to-end; verify Deepeval scorers run, log results, and produce expected figures/tables.notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynbwithvalidmind/datasets/llm/agent_dataset.pyto exercise Task Completion and related scorers.pytest -q tests/test_scorer_decorator.pypytest -q tests/test_unit_tests.pyand any added tests tagged for Deepeval/scorers.What needs special review?
Dependencies, breaking changes, and deployment notes
Release notes
Checklist