Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
1b3f67a
support agent use case
AnilSorathiya Jun 24, 2025
723fcab
wrapper function for agent
AnilSorathiya Jun 24, 2025
28d9fbb
ragas metrics
AnilSorathiya Jun 30, 2025
ecf8e09
update ragas metrics
AnilSorathiya Jun 30, 2025
53e8879
fix lint error
AnilSorathiya Jun 30, 2025
1662368
create helper functions
AnilSorathiya Jul 1, 2025
cc84cbc
Merge branch 'main' into anilsorathiya/sc-10863/add-support-for-llm-a…
AnilSorathiya Jul 2, 2025
6f09780
delete old notebook
AnilSorathiya Jul 2, 2025
0bb731e
update description for each section
AnilSorathiya Jul 2, 2025
e758979
simplify agent
AnilSorathiya Jul 9, 2025
7c35cfe
simple demo notebook using langchain agent
AnilSorathiya Jul 10, 2025
9bb70e9
Update description of the simplified langgraph agent demo notebook
AnilSorathiya Jul 10, 2025
894d52a
add brief description to tests
AnilSorathiya Jul 14, 2025
d86a9af
add brief description to tests
AnilSorathiya Jul 14, 2025
884000f
Allow dict return type predict_fn
AnilSorathiya Jul 17, 2025
fbd5aa9
update notebook and refactor utils
AnilSorathiya Jul 18, 2025
daceabf
lint fix
AnilSorathiya Jul 18, 2025
5f8823a
Merge branch 'main' into anilsorathiya/sc-11324/extend-the-predict-fn…
AnilSorathiya Jul 18, 2025
70a5636
fix the test failure
AnilSorathiya Jul 18, 2025
33b06fb
new unit tests for multiple columns return in assign_predictions
AnilSorathiya Jul 18, 2025
8e12bd2
update notebooks to return multiple values in predict_fn
AnilSorathiya Jul 18, 2025
e38929d
general plotting and stats tests
AnilSorathiya Jul 23, 2025
e900a65
clear output
AnilSorathiya Jul 23, 2025
a08e881
Merge branch 'main' into anilsorathiya/sc-11380/add-generlize-plots-a…
AnilSorathiya Jul 24, 2025
16f4700
remove duplicate tests
AnilSorathiya Jul 24, 2025
bb9f9af
update notebook
AnilSorathiya Jul 24, 2025
5078a7a
Integration between deepeval and validmind
AnilSorathiya Jul 25, 2025
2eb6abb
Merge branch 'main' into anilsorathiya/sc-11452/support-for-the-deepe…
AnilSorathiya Aug 12, 2025
ad0b719
add MetricValues class for metric return type
AnilSorathiya Aug 15, 2025
94ca006
Return MetricValues in the unit tests
AnilSorathiya Aug 15, 2025
c4c885a
update all the unit metric tests
AnilSorathiya Aug 15, 2025
a1f3220
add unit tests for MetricValues class
AnilSorathiya Aug 15, 2025
1a7d0b6
update result to support MetricValues for unit metric tests
AnilSorathiya Aug 15, 2025
1d785ba
add copyright statement
AnilSorathiya Aug 15, 2025
271e85b
add deepeval lib as an extra dependency
AnilSorathiya Aug 15, 2025
f806fc6
fix the error
AnilSorathiya Aug 15, 2025
61c7ef6
demo draft change
AnilSorathiya Aug 18, 2025
b646d0b
demo draft change
AnilSorathiya Aug 18, 2025
dda4ced
fix api issue
AnilSorathiya Aug 18, 2025
dd8e0df
Merge branch 'main' into anilsorathiya/sc-11452/support-for-the-deepe…
AnilSorathiya Aug 21, 2025
81249c2
separate unit metrics and row metrics
AnilSorathiya Aug 22, 2025
794a322
draft notebook
AnilSorathiya Aug 22, 2025
a27bc48
Merge branch 'main' into anilsorathiya/sc-11452/support-for-the-deepe…
AnilSorathiya Aug 22, 2025
84dfa2f
update assign_score notebook
AnilSorathiya Aug 22, 2025
7aa2acc
update assign score notebook
AnilSorathiya Sep 1, 2025
247eacc
rename notebook
AnilSorathiya Sep 1, 2025
394c57c
update deepeval and VM integration notebook
AnilSorathiya Sep 1, 2025
a2ca13c
Merge branch 'main' into anilsorathiya/sc-11452/support-for-the-deepe…
AnilSorathiya Sep 4, 2025
5ebe51f
rename row metrics to scorer
AnilSorathiya Sep 4, 2025
15df53b
add scorer decorator
AnilSorathiya Sep 4, 2025
e28ba37
remove UnitMetricValue and RowMetricValues as they are not needed any…
AnilSorathiya Sep 4, 2025
d8a48c8
remove MetricValue class
AnilSorathiya Sep 5, 2025
d425576
support complex output for scorer
AnilSorathiya Sep 5, 2025
9c7e7e9
remove simple testcases
AnilSorathiya Sep 9, 2025
bbd6cd4
fix the list_scorers
AnilSorathiya Sep 9, 2025
c7b83f3
update notebook
AnilSorathiya Sep 9, 2025
a33f2a4
remove circular dependency of load_test
AnilSorathiya Sep 9, 2025
30c3abc
remove circular dependency of load_test
AnilSorathiya Sep 9, 2025
e91e6e4
move the AnswerRelevancy scorer in deepeval namespace
AnilSorathiya Sep 9, 2025
a284cd1
unit metric can return int and float only
AnilSorathiya Sep 9, 2025
1ec1c75
update notebook
AnilSorathiya Sep 9, 2025
427ddf5
fix lint error
AnilSorathiya Sep 9, 2025
917831c
remove scores listing from list_tests interface
AnilSorathiya Sep 10, 2025
58b3bde
add custom scorer support
AnilSorathiya Sep 10, 2025
cb52104
full path required to run scorer
AnilSorathiya Sep 11, 2025
36f2f96
remove circular dependency
AnilSorathiya Sep 11, 2025
439bd1d
make model parameter option in the assign_scores function
AnilSorathiya Sep 11, 2025
66dde16
fix lint error
AnilSorathiya Sep 11, 2025
b0fe22e
add tests
AnilSorathiya Sep 15, 2025
1fe452d
update notebook
AnilSorathiya Sep 15, 2025
5a101ff
Merge branch 'main' into anilsorathiya/sc-12254/add-new-deepeval-test…
AnilSorathiya Sep 15, 2025
730032c
Merge branch 'main' into anilsorathiya/sc-12254/add-new-deepeval-test…
AnilSorathiya Oct 1, 2025
dc2c743
add deeleval metrics as scorer
AnilSorathiya Oct 2, 2025
7b7a363
add copyright
AnilSorathiya Oct 2, 2025
472a16e
remove Geval test
AnilSorathiya Oct 7, 2025
b2d9a2a
add task completion test
AnilSorathiya Oct 7, 2025
b4c311f
update demo notebook
AnilSorathiya Oct 7, 2025
db63fe4
gitignore *.deepeval
AnilSorathiya Oct 7, 2025
8b43a77
update boxplot
AnilSorathiya Oct 7, 2025
d6c22df
update deepeval integration notebook
AnilSorathiya Oct 7, 2025
1b7cc74
Merge branch 'main' into anilsorathiya/sc-12254/add-new-deepeval-test…
AnilSorathiya Oct 13, 2025
e3755bf
remove all tag from validmind lib installation
AnilSorathiya Oct 14, 2025
a4d7de8
update notebooks
AnilSorathiya Oct 15, 2025
9437e51
update notebook
AnilSorathiya Oct 17, 2025
142950f
Merge branch 'main' into anilsorathiya/sc-12254/add-new-deepeval-test…
AnilSorathiya Oct 17, 2025
38e331d
2.10.1
AnilSorathiya Oct 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -226,3 +226,6 @@ my_tests/
# Quarto docs
docs/validmind.json
*.html
*.qmd
# DeepEval
*.deepeval/
2 changes: 1 addition & 1 deletion notebooks/code_samples/agents/banking_test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"category": "credit_risk"
},
{
"input": "Evaluate credit risk for a business loan of $250,000 with monthly revenue of $85,000 and existing debt of $45,000",
"input": "Evaluate credit risk for a business loan of $250,000 with monthly revenue of $85,000 and existing debt of $45,000 and credit score of 650",
"expected_tools": ["credit_risk_analyzer"],
"possible_outputs": ["MEDIUM RISK", "HIGH RISK", "business loan", "debt service coverage ratio", "1.8", "annual revenue", "$1,020,000", "risk score", "650"],
"session_id": str(uuid.uuid4()),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install -q \"validmind[all]\" langgraph"
"%pip install -q validmind langgraph"
]
},
{
Expand Down Expand Up @@ -202,7 +202,6 @@
"from banking_tools import AVAILABLE_TOOLS\n",
"from validmind.tests import run_test\n",
"\n",
"\n",
"# Load environment variables if using .env file\n",
"try:\n",
" from dotenv import load_dotenv\n",
Expand Down Expand Up @@ -316,8 +315,7 @@
"except Exception as e:\n",
" print(f\"Fraud Detection System test FAILED: {e}\")\n",
"\n",
"print(\"\" + \"=\" * 60)\n",
"\n"
"print(\"\" + \"=\" * 60)"
]
},
{
Expand Down Expand Up @@ -478,8 +476,21 @@
" tool_message = \"\"\n",
" for output in captured_data[\"tool_outputs\"]:\n",
" tool_message += output['content']\n",
" \n",
" tool_calls_found = []\n",
" messages = result['messages']\n",
" for message in messages:\n",
" if hasattr(message, 'tool_calls') and message.tool_calls:\n",
" for tool_call in message.tool_calls:\n",
" # Handle both dictionary and object formats\n",
" if isinstance(tool_call, dict):\n",
" tool_calls_found.append(tool_call['name'])\n",
" else:\n",
" # ToolCall object - use attribute access\n",
" tool_calls_found.append(tool_call.name)\n",
"\n",
"\n",
" return {\"prediction\": result['messages'][-1].content, \"output\": result, \"tool_messages\": [tool_message]}\n",
" return {\"prediction\": result['messages'][-1].content, \"output\": result, \"tool_messages\": [tool_message], \"tool_calls\": tool_calls_found}\n",
" except Exception as e:\n",
" # Return a fallback response if the agent fails\n",
" error_message = f\"\"\"I apologize, but I encountered an error while processing your banking request: {str(e)}.\n",
Expand Down Expand Up @@ -597,7 +608,7 @@
"source": [
"## Banking Test Dataset\n",
"\n",
"We'll use our comprehensive banking test dataset to evaluate our agent's performance across different banking scenarios.\n",
"We'll use a sample test dataset to evaluate our agent's performance across different banking scenarios.\n",
"\n",
"### Initialize ValidMind Dataset\n",
"\n",
Expand Down Expand Up @@ -625,6 +636,15 @@
"print(f\"Dataset columns: {vm_test_dataset._df.columns}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vm_test_dataset._df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -795,7 +815,75 @@
" \"agent_output_column\": \"banking_agent_model_output\",\n",
" \"expected_tools_column\": \"expected_tools\"\n",
" }\n",
")"
").log()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scorers in ValidMind\n",
"\n",
"Scorers are evaluation metrics that analyze model outputs and store their results in the dataset. When using `assign_scores()`:\n",
"\n",
"- Each scorer adds a new column to the dataset with format: {scorer_name}_{metric_name}\n",
"- The column contains the numeric score (typically 0-1) for each example\n",
"- Multiple scorers can be run on the same dataset, each adding their own column\n",
"- Scores are persisted in the dataset for later analysis and visualization\n",
"- Common scorer patterns include:\n",
" - Model performance metrics (accuracy, F1, etc)\n",
" - Output quality metrics (relevance, faithfulness)\n",
" - Task-specific metrics (completion, correctness)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task Completion scorer\n",
"\n",
"The TaskCompletion test evaluates whether our banking agent successfully completes the requested tasks by analyzing its outputs and tool usage. This metric assesses the agent's ability to understand user requests, execute appropriate actions, and provide complete responses that address the original query. The test provides a score between 0-1 along with detailed feedback on task completion quality."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vm_test_dataset.assign_scores(\n",
" metrics = \"validmind.scorer.llm.deepeval.TaskCompletion\",\n",
" input_column=\"input\",\n",
" tools_called_column=\"tools_called\",\n",
" actual_output_column=\"banking_agent_model_prediction\",\n",
" agent_output_column=\"banking_agent_model_output\"\n",
" )\n",
"vm_test_dataset._df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The TaskCompletion scorer has added a new column 'TaskCompletion_score' to our dataset. This is because when we run scorers through assign_scores(), the return values are automatically processed and added as new columns with the format {scorer_name}_{metric_name}. We'll use this column to visualize the distribution of task completion scores across our test cases. Let's visualize the distribution through the box plot test."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_test(\n",
" \"validmind.plots.BoxPlot\",\n",
" inputs={\"dataset\": vm_test_dataset},\n",
" params={\n",
" \"columns\": \"TaskCompletion_score\",\n",
" \"title\": \"Distribution of Task Completion Scores\",\n",
" \"ylabel\": \"Score\",\n",
" \"figsize\": (8, 6)\n",
" }\n",
").log()\n"
]
},
{
Expand Down
Loading
Loading