validmind · juanmleng · Feb 28, 2025 · Feb 18, 2025 · Feb 18, 2025 · Feb 18, 2025
diff --git a/notebooks/how_to/understand_utilize_rawdata.ipynb b/notebooks/how_to/understand_utilize_rawdata.ipynb
@@ -31,7 +31,7 @@
     "  - [Pearson Correlation Matrix](#toc2_2_)    \n",
     "  - [Precision-Recall Curve](#toc2_3_)    \n",
     "  - [Using `RawData` in custom tests](#toc2_4_)    \n",
-    "\n",
+    "  - [Using `RawData` in comparison tests](#toc2_5_)    \n",
     ":::\n",
     "<!-- jn-toc-notebook-config\n",
     "\tnumbering=false\n",
@@ -213,7 +213,8 @@
     "  - [Using `RawData` from the ROC Curve Test](#toc2_1_)    \n",
     "  - [Pearson Correlation Matrix](#toc2_2_)    \n",
     "  - [Precision-Recall Curve](#toc2_3_)    \n",
-    "  - [Using `RawData` in custom tests](#toc2_4_)    "
+    "  - [Using `RawData` in custom tests](#toc2_4_)  \n",
+    "  - [Using `RawData` in comparison tests](#toc2_5_)  "
    ]
   },
   {
@@ -553,17 +554,170 @@
     "    generate_description=False,\n",
     ")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53084493",
+   "metadata": {},
+   "source": [
+    "<a id='toc2_5_'></a>\n",
+    "\n",
+    "### Using `RawData` in comparison tests\n",
+    "\n",
+    "When running comparison tests, the `RawData` object will contain the raw data for each individual test result as well as the comparison results between the test results. To support this, the RawData object contains the model and dataset input_ids for each of the datasets and models in the test, so that the post-processing function can use them to customize the output. The example below shows how to use the `RawData` object to customize the output of a comparison test and add a table to the test result that shows the confusion matrix for each individual test result as well as the comparison results between the test results.\n",
+    "\n",
+    "When designing post-processing functions that need to handle both individual and comparison test results, you can check the structure of the raw data to determine which case you're dealing with. In the example below, we check if `confusion_matrix` is a list (comparison test with multiple matrices) or a single matrix (individual test). For comparison tests, the function creates two tables: one showing the confusion matrices for each test case, and another showing the percentage drift between them. For individual tests, it creates a single table with the confusion matrix values. This pattern of checking the raw data structure can be applied to other tests to create versatile post-processing functions that work in both scenarios.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "bcbbe9f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def cm_table(result: TestResult):\n",
+    "    # For individual results\n",
+    "    if not isinstance(result.raw_data.confusion_matrix, list):\n",
+    "        # Extract values from single confusion matrix\n",
+    "        cm = result.raw_data.confusion_matrix\n",
+    "        tn, fp = cm[0, 0], cm[0, 1]\n",
+    "        fn, tp = cm[1, 0], cm[1, 1]\n",
+    "        \n",
+    "        # Create DataFrame for individual matrix\n",
+    "        cm_df = pd.DataFrame({\n",
+    "            'TN': [tn],\n",
+    "            'FP': [fp],\n",
+    "            'FN': [fn],\n",
+    "            'TP': [tp]\n",
+    "        })\n",
+    "        \n",
+    "        # Add individual table\n",
+    "        result.add_table(cm_df, title=\"Confusion Matrix\")\n",
+    "        \n",
+    "    # For comparison results\n",
+    "    else:\n",
+    "        cms = result.raw_data.confusion_matrix\n",
+    "        cm1, cm2 = cms[0], cms[1]\n",
+    "        \n",
+    "        # Create individual results table\n",
+    "        rows = []\n",
+    "        for i, cm in enumerate(cms):\n",
+    "            rows.append({\n",
+    "                'dataset': result.raw_data.dataset[i],\n",
+    "                'model': result.raw_data.model[i],\n",
+    "                'TN': cm[0, 0],\n",
+    "                'FP': cm[0, 1],\n",
+    "                'FN': cm[1, 0],\n",
+    "                'TP': cm[1, 1]\n",
+    "            })\n",
+    "        individual_df = pd.DataFrame(rows)\n",
+    "        \n",
+    "        # Calculate percentage differences\n",
+    "        diff_df = pd.DataFrame({\n",
+    "            'TN_drift (%)': [(cm2[0, 0] - cm1[0, 0]) / cm1[0, 0] * 100],\n",
+    "            'FP_drift (%)': [(cm2[0, 1] - cm1[0, 1]) / cm1[0, 1] * 100],\n",
+    "            'FN_drift (%)': [(cm2[1, 0] - cm1[1, 0]) / cm1[1, 0] * 100],\n",
+    "            'TP_drift (%)': [(cm2[1, 1] - cm1[1, 1]) / cm1[1, 1] * 100]\n",
+    "        }).round(2)\n",
+    "        \n",
+    "        # Add both tables\n",
+    "        result.add_table(individual_df, title=\"Individual Confusion Matrices\")\n",
+    "        result.add_table(diff_df, title=\"Confusion Matrix Drift\")\n",
+    "        \n",
+    "    return result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41edd959",
+   "metadata": {},
+   "source": [
+    "Let's first run the confusion matrix test on a single dataset-model pair to see how our post-processing function handles individual results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cf3c47fe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from validmind.tests import run_test\n",
+    "\n",
+    "result_cm = run_test(\n",
+    "    \"validmind.model_validation.sklearn.ConfusionMatrix\",\n",
+    "    inputs={\n",
+    "        \"dataset\": vm_test_ds,\n",
+    "        \"model\": vm_model,\n",
+    "    },\n",
+    "    post_process_fn=cm_table,\n",
+    "    generate_description=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a2482c54",
+   "metadata": {},
+   "source": [
+    "Now let's run a comparison test between test and train datasets to see how the function handles multiple results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6a1b4388",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result_cm = run_test(\n",
+    "    \"validmind.model_validation.sklearn.ConfusionMatrix\",\n",
+    "    input_grid={\n",
+    "        \"dataset\": [vm_test_ds, vm_train_ds],\n",
+    "        \"model\": [vm_model]\n",
+    "    },\n",
+    "    post_process_fn=cm_table,\n",
+    "    generate_description=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9f7d361a",
+   "metadata": {},
+   "source": [
+    "Let's inspect the raw data to see how comparison tests structure their data - notice how the `RawData` object contains not just the confusion matrices for both datasets, but also tracks which dataset and model each result came from:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "012ec495",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result_cm.raw_data.inspect()"
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "ValidMind Library",
    "language": "python",
-   "name": "python3"
+   "name": "validmind"
   },
   "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
    "name": "python",
-   "version": "3.10.13"
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.15"
   }
  },
  "nbformat": 4,

diff --git a/pyproject.toml b/pyproject.toml
@@ -10,7 +10,7 @@ description = "ValidMind Library"
 license = "Commercial License"
 name = "validmind"
 readme = "README.pypi.md"
-version = "2.8.11"
+version = "2.8.12"
 
 [tool.poetry.dependencies]
 aiohttp = {extras = ["speedups"], version = "*"}

diff --git a/scripts/bulk_ai_test_updates.py b/scripts/bulk_ai_test_updates.py
@@ -118,7 +118,7 @@ def list_to_str(lst):
 
 **Purpose**:
 
-The Feature Drift test aims to evaluate how much the distribution of features has shifted over time between two datasets, typically training and monitoring datasets. It uses the Population Stability Index (PSI) to quantify this change, providing insights into the model’s robustness and the necessity for retraining or feature engineering.
+The Feature Drift test aims to evaluate how much the distribution of features has shifted over time between two datasets, typically training and monitoring datasets. It uses the Population Stability Index (PSI) to quantify this change, providing insights into the model's robustness and the necessity for retraining or feature engineering.
 
 **Test Mechanism**:
 
@@ -181,6 +181,11 @@ def list_to_str(lst):
 Its a class that can be initialized with any number of any type of objects using a key-value like interface where the key in the constructor is the name of the object and the value is the object itself.
 It should only be used to store data that is not already returned as part of the test result (i.e. in a table) but could be useful to re-generate any of the test result objects (tables, figures).
 
+When adding raw data, you should always include:
+- If the test has access to a model parameter (VMModel), include its input_id as model=model.input_id
+- If the test has access to a dataset parameter (VMDataset), include its input_id as dataset=dataset.input_id
+Only include these if they are available in the test function parameters - don't force both if only one is accessible.
+
 You will be provided with the source code for a "test" that is run against an ML model or dataset.
 You will analyze the code to determine the details and implementation of the test.
 Then you will use the below example to implement changes to the test to make it use the new raw data mechanism offered by the ValidMind SDK.
@@ -228,7 +233,7 @@ def ExampleConfusionMatrix(model: VMModel, dataset: VMDataset):
     fig = ff.create_annotated_heatmap()
     ..
 
-    return fig, RawData(confusion_matrix=cm)
+    return fig, RawData(confusion_matrix=cm, model=model.input_id, dataset=dataset.input_id)
 ```
 
 Notice that the test now returns a tuple of the figure and the raw data.

diff --git a/tests/unit_tests/data_validation/test_IQROutliersTable.py b/tests/unit_tests/data_validation/test_IQROutliersTable.py
@@ -38,10 +38,11 @@ def setUp(self):
         )
 
     def test_outliers_structure(self):
-        result = IQROutliersTable(self.vm_dataset)
+        result, raw_data = IQROutliersTable(self.vm_dataset)
 
         # Check basic structure
         self.assertIsInstance(result, dict)
+        self.assertIsInstance(raw_data, vm.RawData)
         self.assertIn("Summary of Outliers Detected by IQR Method", result)
 
         # Check result structure
@@ -59,7 +60,7 @@ def test_outliers_structure(self):
             self.assertIn("Maximum Outlier Value", summary)
 
     def test_outliers_detection(self):
-        result = IQROutliersTable(self.vm_dataset)
+        result, raw_data = IQROutliersTable(self.vm_dataset)
         outliers_summary = result["Summary of Outliers Detected by IQR Method"]
 
         # Check that outliers are detected in the 'with_outliers' column
@@ -76,7 +77,7 @@ def test_outliers_detection(self):
         self.assertIsNone(normal_summary)
 
     def test_binary_exclusion(self):
-        result = IQROutliersTable(self.vm_dataset)
+        result, raw_data = IQROutliersTable(self.vm_dataset)
         outliers_summary = result["Summary of Outliers Detected by IQR Method"]
 
         # Verify binary column is not in results

diff --git a/tests/unit_tests/data_validation/test_IsolationForestOutliers.py b/tests/unit_tests/data_validation/test_IsolationForestOutliers.py
@@ -28,25 +28,19 @@ def setUp(self):
         )
 
     def test_outliers_detection(self):
-        result = IsolationForestOutliers(self.vm_dataset, contamination=0.1)
+        figure, raw_data = IsolationForestOutliers(self.vm_dataset, contamination=0.1)
 
-        # Check return type
-        self.assertIsInstance(result, tuple)
+        # Check return types
+        self.assertIsInstance(figure, plt.Figure)
+        self.assertIsInstance(raw_data, vm.RawData)
 
-        # Separate figures and raw data
-        figures = result
-
-        # Check that at least one figure is returned
-        self.assertGreater(len(figures), 0)
-
-        # Check each figure
-        for fig in figures:
-            self.assertIsInstance(fig, plt.Figure)
+        # Check that the figure has at least one axes
+        self.assertGreater(len(figure.axes), 0)
 
     def test_feature_columns_validation(self):
         # Test with valid feature columns
         try:
-            IsolationForestOutliers(
+            figure, raw_data = IsolationForestOutliers(
                 self.vm_dataset, feature_columns=["feature1", "feature2"]
             )
         except ValueError:
@@ -60,13 +54,13 @@ def test_feature_columns_validation(self):
 
     def test_contamination_parameter(self):
         # Test with different contamination levels
-        figures_low_contamination = IsolationForestOutliers(
+        figure_low, raw_data_low = IsolationForestOutliers(
             self.vm_dataset, contamination=0.05
         )
-        figures_high_contamination = IsolationForestOutliers(
+        figure_high, raw_data_high = IsolationForestOutliers(
             self.vm_dataset, contamination=0.2
         )
 
-        # Check that figures are returned for both contamination levels
-        self.assertGreater(len(figures_low_contamination), 0)
-        self.assertGreater(len(figures_high_contamination), 0)
+        # Check that figures have at least one axes
+        self.assertGreater(len(figure_low.axes), 0)
+        self.assertGreater(len(figure_high.axes), 0)
diff --git a/tests/unit_tests/data_validation/test_JarqueBera.py b/tests/unit_tests/data_validation/test_JarqueBera.py
@@ -29,11 +29,14 @@ def test_returns_dataframe_and_rawdata(self):
         )
 
         # Run the function
-        result = JarqueBera(vm_dataset)
+        result, raw_data = JarqueBera(vm_dataset)
 
         # Check if result is a DataFrame
         self.assertIsInstance(result, pd.DataFrame)
 
+        # Check if raw_data is a RawData object
+        self.assertIsInstance(raw_data, vm.RawData)
+
         # Check if the DataFrame has the expected columns
         expected_columns = ["column", "stat", "pvalue", "skew", "kurtosis"]
         self.assertListEqual(list(result.columns), expected_columns)

diff --git a/tests/unit_tests/data_validation/test_LJungBox.py b/tests/unit_tests/data_validation/test_LJungBox.py
@@ -22,11 +22,14 @@ def test_returns_dataframe_with_expected_shape(self):
         )
 
         # Run the function
-        result = LJungBox(vm_dataset)
+        result, raw_data = LJungBox(vm_dataset)
 
         # Check if result is a DataFrame
         self.assertIsInstance(result, pd.DataFrame)
 
+        # Check if raw_data is a RawData object
+        self.assertIsInstance(raw_data, vm.RawData)
+
         # Check if the DataFrame has the expected columns
         expected_columns = ["column", "stat", "pvalue"]
         self.assertListEqual(list(result.columns), expected_columns)

diff --git a/tests/unit_tests/data_validation/test_MissingValues.py b/tests/unit_tests/data_validation/test_MissingValues.py
@@ -28,11 +28,13 @@ def setUp(self):
         )
 
     def test_missing_values_structure(self):
-        summary, passed = MissingValues(self.vm_dataset)
+        # Run the function
+        summary, passed, raw_data = MissingValues(self.vm_dataset)
 
         # Check return types
         self.assertIsInstance(summary, list)
         self.assertIsInstance(passed, bool)
+        self.assertIsInstance(raw_data, vm.RawData)
 
         # Check summary structure
         for column_summary in summary:
@@ -42,7 +44,7 @@ def test_missing_values_structure(self):
             self.assertIn("Pass/Fail", column_summary)
 
     def test_missing_values_counts(self):
-        summary, passed = MissingValues(self.vm_dataset)
+        summary, passed, raw_data = MissingValues(self.vm_dataset)
 
         # Get results for each column
         no_missing = next(s for s in summary if s["Column"] == "no_missing")
@@ -69,7 +71,7 @@ def test_missing_values_counts(self):
 
     def test_threshold_parameter(self):
         # Test with higher threshold that allows some missing values
-        summary, passed = MissingValues(self.vm_dataset, min_threshold=25)
+        summary, passed, raw_data = MissingValues(self.vm_dataset, min_threshold=25)
 
         # Get results
         some_missing = next(s for s in summary if s["Column"] == "some_missing")