validmind · AnilSorathiya · Oct 17, 2025 · Jun 24, 2025 · Jun 24, 2025 · Jun 30, 2025
diff --git a/.gitignore b/.gitignore
@@ -226,3 +226,6 @@ my_tests/
 # Quarto docs
 docs/validmind.json
 *.html
+*.qmd
+# DeepEval
+*.deepeval/
diff --git a/notebooks/code_samples/agents/banking_test_dataset.py b/notebooks/code_samples/agents/banking_test_dataset.py
@@ -12,7 +12,7 @@
         "category": "credit_risk"
     },
     {
-        "input": "Evaluate credit risk for a business loan of $250,000 with monthly revenue of $85,000 and existing debt of $45,000",
+        "input": "Evaluate credit risk for a business loan of $250,000 with monthly revenue of $85,000 and existing debt of $45,000 and credit score of 650",
         "expected_tools": ["credit_risk_analyzer"],
         "possible_outputs": ["MEDIUM RISK", "HIGH RISK", "business loan", "debt service coverage ratio", "1.8", "annual revenue", "$1,020,000", "risk score", "650"],
         "session_id": str(uuid.uuid4()),

diff --git a/notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb b/notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb
@@ -117,7 +117,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install -q \"validmind[all]\" langgraph"
+    "%pip install -q validmind langgraph"
    ]
   },
   {
@@ -202,7 +202,6 @@
     "from banking_tools import AVAILABLE_TOOLS\n",
     "from validmind.tests import run_test\n",
     "\n",
-    "\n",
     "# Load environment variables if using .env file\n",
     "try:\n",
     "    from dotenv import load_dotenv\n",
@@ -316,8 +315,7 @@
     "except Exception as e:\n",
     "    print(f\"Fraud Detection System test FAILED: {e}\")\n",
     "\n",
-    "print(\"\" + \"=\" * 60)\n",
-    "\n"
+    "print(\"\" + \"=\" * 60)"
    ]
   },
   {
@@ -478,8 +476,21 @@
     "        tool_message = \"\"\n",
     "        for output in captured_data[\"tool_outputs\"]:\n",
     "            tool_message += output['content']\n",
+    "        \n",
+    "        tool_calls_found = []\n",
+    "        messages = result['messages']\n",
+    "        for message in messages:\n",
+    "            if hasattr(message, 'tool_calls') and message.tool_calls:\n",
+    "                for tool_call in message.tool_calls:\n",
+    "                    # Handle both dictionary and object formats\n",
+    "                    if isinstance(tool_call, dict):\n",
+    "                        tool_calls_found.append(tool_call['name'])\n",
+    "                    else:\n",
+    "                        # ToolCall object - use attribute access\n",
+    "                        tool_calls_found.append(tool_call.name)\n",
+    "\n",
     "\n",
-    "        return {\"prediction\": result['messages'][-1].content, \"output\": result, \"tool_messages\": [tool_message]}\n",
+    "        return {\"prediction\": result['messages'][-1].content, \"output\": result, \"tool_messages\": [tool_message], \"tool_calls\": tool_calls_found}\n",
     "    except Exception as e:\n",
     "        # Return a fallback response if the agent fails\n",
     "        error_message = f\"\"\"I apologize, but I encountered an error while processing your banking request: {str(e)}.\n",
@@ -597,7 +608,7 @@
    "source": [
     "## Banking Test Dataset\n",
     "\n",
-    "We'll use our comprehensive banking test dataset to evaluate our agent's performance across different banking scenarios.\n",
+    "We'll use a sample test dataset to evaluate our agent's performance across different banking scenarios.\n",
     "\n",
     "### Initialize ValidMind Dataset\n",
     "\n",
@@ -625,6 +636,15 @@
     "print(f\"Dataset columns: {vm_test_dataset._df.columns}\")\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vm_test_dataset._df.head(1)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -795,7 +815,75 @@
     "        \"agent_output_column\": \"banking_agent_model_output\",\n",
     "        \"expected_tools_column\": \"expected_tools\"\n",
     "    }\n",
-    ")"
+    ").log()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Scorers in ValidMind\n",
+    "\n",
+    "Scorers are evaluation metrics that analyze model outputs and store their results in the dataset. When using `assign_scores()`:\n",
+    "\n",
+    "- Each scorer adds a new column to the dataset with format: {scorer_name}_{metric_name}\n",
+    "- The column contains the numeric score (typically 0-1) for each example\n",
+    "- Multiple scorers can be run on the same dataset, each adding their own column\n",
+    "- Scores are persisted in the dataset for later analysis and visualization\n",
+    "- Common scorer patterns include:\n",
+    "  - Model performance metrics (accuracy, F1, etc)\n",
+    "  - Output quality metrics (relevance, faithfulness)\n",
+    "  - Task-specific metrics (completion, correctness)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task Completion scorer\n",
+    "\n",
+    "The TaskCompletion test evaluates whether our banking agent successfully completes the requested tasks by analyzing its outputs and tool usage. This metric assesses the agent's ability to understand user requests, execute appropriate actions, and provide complete responses that address the original query. The test provides a score between 0-1 along with detailed feedback on task completion quality."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "vm_test_dataset.assign_scores(\n",
+    "    metrics = \"validmind.scorer.llm.deepeval.TaskCompletion\",\n",
+    "    input_column=\"input\",\n",
+    "    tools_called_column=\"tools_called\",\n",
+    "    actual_output_column=\"banking_agent_model_prediction\",\n",
+    "    agent_output_column=\"banking_agent_model_output\"\n",
+    "    )\n",
+    "vm_test_dataset._df.head(2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The TaskCompletion scorer has added a new column 'TaskCompletion_score' to our dataset. This is because when we run scorers through assign_scores(), the return values are automatically processed and added as new columns with the format {scorer_name}_{metric_name}. We'll use this column to visualize the distribution of task completion scores across our test cases. Let's visualize the distribution through the box plot test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run_test(\n",
+    "    \"validmind.plots.BoxPlot\",\n",
+    "    inputs={\"dataset\": vm_test_dataset},\n",
+    "    params={\n",
+    "        \"columns\": \"TaskCompletion_score\",\n",
+    "        \"title\": \"Distribution of Task Completion Scores\",\n",
+    "        \"ylabel\": \"Score\",\n",
+    "        \"figsize\": (8, 6)\n",
+    "    }\n",
+    ").log()\n"
    ]
   },
   {