diff --git a/notebooks/how_to/assign_score_complete_tutorial.ipynb b/notebooks/how_to/assign_score_complete_tutorial.ipynb
new file mode 100644
index 000000000..cbb1d14bd
--- /dev/null
+++ b/notebooks/how_to/assign_score_complete_tutorial.ipynb
@@ -0,0 +1,723 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "# Intro to Assign Scores\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "The `assign_scores()` method is a powerful feature that allows you to compute and add unit metric scores as new columns in your dataset. This method takes a model and metric(s) as input, computes the specified metrics from the ValidMind unit_metrics library, and adds them as new columns. The computed metrics can be scalar values that apply to the entire dataset or per-row values, providing flexibility in how performance is measured and tracked.\n",
+ "\n",
+ "In this interactive notebook, we demonstrate how to use the `assign_scores()` method effectively. We'll walk through a complete example using a customer churn dataset, showing how to compute and assign both dataset-level metrics (like overall F1 score) and row-level metrics (like prediction probabilities). You'll learn how to work with single and multiple unit metrics, pass custom parameters, and handle different metric types - all while maintaining a clean, organized dataset structure. Currently, assign_scores() supports all metrics available in the validmind.unit_metrics module.\n",
+ "\n",
+ "**The Power of Integrated Scoring**\n",
+ "\n",
+ "Traditional model evaluation workflows often involve computing metrics separately from your core dataset, leading to fragmented analysis and potential data misalignment. The `assign_scores()` method addresses this challenge by:\n",
+ "\n",
+ "- **Seamless Integration**: Directly embedding computed metrics as dataset columns using a consistent naming convention\n",
+ "- **Enhanced Traceability**: Maintaining clear links between model predictions and performance metrics\n",
+ "- **Simplified Analysis**: Enabling straightforward comparison of metrics across different models and datasets\n",
+ "- **Standardized Workflow**: Providing a unified approach to metric computation and storage\n",
+ "\n",
+ "**Understanding assign_scores()**\n",
+ "\n",
+ "The `assign_scores()` method computes unit metrics for a given model-dataset combination and adds the results as new columns to your dataset. Each new column follows the naming convention: `{model.input_id}_{metric_name}`, ensuring clear identification of which model and metric combination generated each score.\n",
+ "\n",
+ "Key features:\n",
+ "\n",
+ "- **Flexible Input**: Accepts single metrics or lists of metrics\n",
+ "- **Parameter Support**: Allows passing additional parameters to underlying metric implementations\n",
+ "- **Multi-Model Support**: Can assign scores from multiple models to the same dataset\n",
+ "- **Type Agnostic**: Works with classification, regression, and other model types\n",
+ "\n",
+ "This approach streamlines your model evaluation workflow, making performance metrics an integral part of your dataset rather than external calculations.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## Contents \n",
+ "- [About ValidMind](#toc1_) \n",
+ " - [Before you begin](#toc1_1_) \n",
+ " - [New to ValidMind?](#toc1_2_) \n",
+ "- [Install the ValidMind Library](#toc2_) \n",
+ "- [Initialize the ValidMind Library](#toc3_) \n",
+ " - [Get your code snippet](#toc3_1_) \n",
+ "- [Load the demo dataset](#toc4_) \n",
+ "- [Train models for testing](#toc5_) \n",
+ "- [Initialize ValidMind objects](#toc6_) \n",
+ "- [Assign predictions](#toc7_) \n",
+ "- [Using assign_scores()](#toc8_) \n",
+ " - [Basic Usage](#toc8_1_) \n",
+ " - [Single Metric Assignment](#toc8_2_) \n",
+ " - [Multiple Metrics Assignment](#toc8_3_) \n",
+ " - [Passing Parameters to Metrics](#toc8_4_) \n",
+ " - [Working with Different Metric Types](#toc8_5_) \n",
+ "- [Advanced assign_scores() Usage](#toc9_) \n",
+ " - [Multi-Model Scoring](#toc9_1_) \n",
+ " - [Individual Metrics](#toc9_2_) \n",
+ "- [Next steps](#toc12_) \n",
+ " - [Work with your model documentation](#toc12_1_) \n",
+ " - [Discover more learning resources](#toc12_2_) \n",
+ "- [Upgrade ValidMind](#toc13_) \n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "\n",
+ "## About ValidMind \n",
+ "\n",
+ "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.\n",
+ "\n",
+ "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Before you begin \n",
+ "\n",
+ "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n",
+ "\n",
+ "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "### New to ValidMind? \n",
+ "\n",
+ "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n",
+ "\n",
+ "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n",
+ "
\n",
+ "
Register with ValidMind \n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Install the ValidMind Library\n",
+ "\n",
+ "To install the library:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install -q validmind\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Initialize the ValidMind Library \n",
+ "\n",
+ "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Get your code snippet\n",
+ "\n",
+ "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n",
+ "\n",
+ "2. In the left sidebar, navigate to **Model Inventory** and click **+ Register Model**.\n",
+ "\n",
+ "3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n",
+ "\n",
+ " For example, to register a model for use with this notebook, select:\n",
+ "\n",
+ " - Documentation template: `Binary classification`\n",
+ " - Use case: `Marketing/Sales - Analytics`\n",
+ "\n",
+ " You can fill in other options according to your preference.\n",
+ "\n",
+ "4. Go to **Getting Started** and click **Copy snippet to clipboard**.\n",
+ "\n",
+ "Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load your model identifier credentials from an `.env` file\n",
+ "\n",
+ "%load_ext dotenv\n",
+ "%dotenv .env\n",
+ "\n",
+ "# Or replace with your code snippet\n",
+ "\n",
+ "import validmind as vm\n",
+ "\n",
+ "vm.init(\n",
+ " # api_host=\"...\",\n",
+ " # api_key=\"...\",\n",
+ " # api_secret=\"...\",\n",
+ " # model=\"...\",\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Load the demo dataset \n",
+ "\n",
+ "In this example, we load a demo dataset to demonstrate the assign_scores functionality with customer churn prediction models.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from validmind.datasets.classification import customer_churn as demo_dataset\n",
+ "\n",
+ "print(\n",
+ " f\"Loaded demo dataset with: \\n\\n\\t• Target column: '{demo_dataset.target_column}' \\n\\t• Class labels: {demo_dataset.class_labels}\"\n",
+ ")\n",
+ "\n",
+ "raw_df = demo_dataset.load_data()\n",
+ "raw_df.head()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Train models for testing \n",
+ "\n",
+ "We'll train two different customer churn models to demonstrate the assign_scores functionality with multiple models.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import xgboost as xgb\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "\n",
+ "# Preprocess the data\n",
+ "train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)\n",
+ "\n",
+ "# Prepare training data\n",
+ "x_train = train_df.drop(demo_dataset.target_column, axis=1)\n",
+ "y_train = train_df[demo_dataset.target_column]\n",
+ "x_val = validation_df.drop(demo_dataset.target_column, axis=1)\n",
+ "y_val = validation_df[demo_dataset.target_column]\n",
+ "\n",
+ "# Train XGBoost model\n",
+ "xgb_model = xgb.XGBClassifier(early_stopping_rounds=10, random_state=42)\n",
+ "xgb_model.set_params(\n",
+ " eval_metric=[\"error\", \"logloss\", \"auc\"],\n",
+ ")\n",
+ "xgb_model.fit(\n",
+ " x_train,\n",
+ " y_train,\n",
+ " eval_set=[(x_val, y_val)],\n",
+ " verbose=False,\n",
+ ")\n",
+ "\n",
+ "# Train Random Forest model\n",
+ "rf_model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
+ "rf_model.fit(x_train, y_train)\n",
+ "\n",
+ "print(\"Models trained successfully!\")\n",
+ "print(f\"XGBoost training accuracy: {xgb_model.score(x_train, y_train):.3f}\")\n",
+ "print(f\"Random Forest training accuracy: {rf_model.score(x_train, y_train):.3f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Initialize ValidMind objects \n",
+ "\n",
+ "We initialize ValidMind `dataset` and `model` objects. The `input_id` parameter is crucial for the assign_scores functionality as it determines the column naming convention for assigned scores.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Initialize datasets\n",
+ "vm_train_ds = vm.init_dataset(\n",
+ " input_id=\"train_dataset\",\n",
+ " dataset=train_df,\n",
+ " target_column=demo_dataset.target_column,\n",
+ ")\n",
+ "vm_test_ds = vm.init_dataset(\n",
+ " input_id=\"test_dataset\",\n",
+ " dataset=test_df,\n",
+ " target_column=demo_dataset.target_column,\n",
+ ")\n",
+ "\n",
+ "# Initialize models with descriptive input_ids\n",
+ "vm_xgb_model = vm.init_model(model=xgb_model, input_id=\"xgboost_model\")\n",
+ "vm_rf_model = vm.init_model(model=rf_model, input_id=\"random_forest_model\")\n",
+ "\n",
+ "print(\"ValidMind objects initialized successfully!\")\n",
+ "print(f\"XGBoost model ID: {vm_xgb_model.input_id}\")\n",
+ "print(f\"Random Forest model ID: {vm_rf_model.input_id}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Assign predictions \n",
+ "\n",
+ "Before we can use assign_scores(), we need to assign predictions to our datasets. This step is essential as many unit metrics require both actual and predicted values.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Assign predictions for both models to both datasets\n",
+ "vm_train_ds.assign_predictions(model=vm_xgb_model)\n",
+ "vm_train_ds.assign_predictions(model=vm_rf_model)\n",
+ "\n",
+ "vm_test_ds.assign_predictions(model=vm_xgb_model)\n",
+ "vm_test_ds.assign_predictions(model=vm_rf_model)\n",
+ "\n",
+ "print(\"Predictions assigned successfully!\")\n",
+ "print(f\"Test dataset now has {len(vm_test_ds.df.columns)} columns\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Using assign_scores()\n",
+ "\n",
+ "Now we'll explore the various ways to use the assign_scores() method to integrate performance metrics directly into your dataset.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "### Basic Usage\n",
+ "\n",
+ "The assign_scores() method has a simple interface:\n",
+ "\n",
+ "```python\n",
+ "dataset.assign_scores(model, metrics, **kwargs)\n",
+ "```\n",
+ "\n",
+ "- **model**: A ValidMind model object\n",
+ "- **metrics**: Single metric ID or list of metric IDs (can use short names or full IDs)\n",
+ "- **kwargs**: Additional parameters passed to the underlying metric implementations\n",
+ "\n",
+ "Let's first check what columns we currently have in our test dataset:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(\"Current columns in test dataset:\")\n",
+ "for i, col in enumerate(vm_test_ds.df.columns, 1):\n",
+ " print(f\"{i:2d}. {col}\")\n",
+ "\n",
+ "print(f\"\\nDataset shape: {vm_test_ds.df.shape}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "### Single Metric Assignment\n",
+ "\n",
+ "Let's start by assigning a single metric - the F1 score - for our XGBoost model on the test dataset.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Assign F1 score for XGBoost model\n",
+ "vm_test_ds.assign_scores(vm_xgb_model, \"F1\")\n",
+ "\n",
+ "print(\"After assigning F1 score:\")\n",
+ "print(f\"New column added: {vm_test_ds.df.columns}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "### Multiple Metrics Assignment\n",
+ "\n",
+ "We can assign multiple metrics at once by passing a list of metric names. This is more efficient than calling assign_scores() multiple times.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Assign multiple classification metrics for the Random Forest model\n",
+ "classification_metrics = [\"Precision\", \"Recall\", \"Accuracy\", \"ROC_AUC\"]\n",
+ "\n",
+ "vm_test_ds.assign_scores(vm_rf_model, classification_metrics)\n",
+ "\n",
+ "print(\"After assigning multiple metrics for Random Forest:\")\n",
+ "rf_columns = [col for col in vm_test_ds.df.columns if 'random_forest_model' in col]\n",
+ "print(f\"Random Forest columns: {rf_columns}\")\n",
+ "\n",
+ "# Display the metric values\n",
+ "for metric in classification_metrics:\n",
+ " col_name = f\"random_forest_model_{metric}\"\n",
+ " if col_name in vm_test_ds.df.columns:\n",
+ " value = vm_test_ds.df[col_name].iloc[0]\n",
+ " print(f\"{metric}: {value:.4f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "### Passing Parameters to Metrics\n",
+ "\n",
+ "Many unit metrics accept additional parameters that are passed through to the underlying sklearn implementations. Let's demonstrate this with the ROC_AUC metric.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Assign ROC_AUC with different averaging strategies\n",
+ "vm_test_ds.assign_scores(vm_xgb_model, \"ROC_AUC\", average=\"macro\")\n",
+ "\n",
+ "# We can also assign with different parameters by calling assign_scores again\n",
+ "# Note: This will overwrite the previous column with the same name\n",
+ "print(\"ROC_AUC assigned with macro averaging\")\n",
+ "\n",
+ "# Let's also assign precision and recall with different averaging\n",
+ "vm_test_ds.assign_scores(vm_xgb_model, [\"Precision\", \"Recall\"], average=\"weighted\")\n",
+ "\n",
+ "print(\"Precision and Recall assigned with weighted averaging\")\n",
+ "\n",
+ "# Display current XGBoost metric columns\n",
+ "xgb_columns = [col for col in vm_test_ds.df.columns if 'xgboost_model' in col]\n",
+ "print(f\"\\nXGBoost model columns: {xgb_columns}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "### Multi-Model Scoring\n",
+ "\n",
+ "One of the powerful features of assign_scores() is the ability to assign scores from multiple models to the same dataset, enabling easy model comparison.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Let's assign a comprehensive set of metrics for both models\n",
+ "comprehensive_metrics = [\"F1\", \"Precision\", \"Recall\", \"Accuracy\", \"ROC_AUC\"]\n",
+ "\n",
+ "# Assign for XGBoost model\n",
+ "vm_test_ds.assign_scores(vm_xgb_model, comprehensive_metrics)\n",
+ "\n",
+ "# Assign for Random Forest model}\n",
+ "vm_test_ds.assign_scores(vm_rf_model, comprehensive_metrics)\n",
+ "\n",
+ "print(\"Comprehensive metrics assigned for both models!\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "### Individual Metrics\n",
+ "The next section demonstrates how to assign individual metrics that compute scores per row, rather than aggregate metrics.\n",
+ "We'll use two important metrics:\n",
+ " \n",
+ "- Brier Score: Measures how well calibrated the model's probability predictions are for each individual prediction\n",
+ "- Log Loss: Evaluates how well the predicted probabilities match the true labels on a per-prediction basis\n",
+ "\n",
+ "Both metrics provide more granular insights into model performance at the individual prediction level.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Let's add some individual metrics that compute per-row scores\n",
+ "print(\"Adding individual metrics...\")\n",
+ "\n",
+ "# Add Brier Score - measures accuracy of probabilistic predictions per row\n",
+ "vm_test_ds.assign_scores(vm_xgb_model, \"BrierScore\")\n",
+ "print(\"Added Brier Score - lower values indicate better calibrated probabilities\")\n",
+ "\n",
+ "# Add Log Loss - measures how well the predicted probabilities match true labels per row\n",
+ "vm_test_ds.assign_scores(vm_xgb_model, \"LogLoss\")\n",
+ "print(\"Added Log Loss - lower values indicate better probability estimates\")\n",
+ "\n",
+ "# Create a comparison summary showing first few rows of individual metrics\n",
+ "print(\"\\nFirst few rows of individual metrics:\")\n",
+ "individual_metrics = [col for col in vm_test_ds.df.columns if any(m in col for m in ['BrierScore', 'LogLoss'])]\n",
+ "print(vm_test_ds.df[individual_metrics].head())\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "vm_test_ds._df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Next steps \n",
+ "\n",
+ "You can explore the assigned scores right in the notebook as demonstrated above. However, there's even more value in using the ValidMind Platform to work with your model documentation and monitoring.\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Work with your model documentation \n",
+ "\n",
+ "1. From the **Model Inventory** in the ValidMind Platform, go to the model you registered earlier. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/working-with-model-inventory.html))\n",
+ "\n",
+ "2. Click and expand the **Model Development** section.\n",
+ "\n",
+ "The scores you've assigned using `assign_scores()` become part of your model's documentation and can be used in ongoing monitoring workflows. You can view these metrics over time, set up alerts for performance drift, and compare models systematically. [Learn more ...](https://docs.validmind.ai/guide/model-documentation/working-with-model-documentation.html)\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Discover more learning resources \n",
+ "\n",
+ "We offer many interactive notebooks to help you work with model scoring and evaluation:\n",
+ "\n",
+ "- [Run unit metrics](https://docs.validmind.ai/developer/model-testing/testing-overview.html)\n",
+ "- [Assign predictions](https://docs.validmind.ai/developer/samples-jupyter-notebooks.html)\n",
+ "- [Model comparison workflows](https://docs.validmind.ai/developer/samples-jupyter-notebooks.html)\n",
+ "\n",
+ "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "\n",
+ "\n",
+ "## Upgrade ValidMind\n",
+ "\n",
+ "After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n",
+ "\n",
+ "Retrieve the information for the currently installed version of ValidMind:\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "%pip show validmind\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n",
+ "\n",
+ "```bash\n",
+ "%pip install --upgrade validmind\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "You may need to restart your kernel after running the upgrade package for changes to be applied.\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "ValidMind Library",
+ "language": "python",
+ "name": "validmind"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/pyproject.toml b/pyproject.toml
index 12ceea155..090a05cfe 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -10,7 +10,7 @@ description = "ValidMind Library"
license = "Commercial License"
name = "validmind"
readme = "README.pypi.md"
-version = "2.8.31"
+version = "2.9.0"
[tool.poetry.dependencies]
aiohttp = {extras = ["speedups"], version = "*"}
diff --git a/tests/test_dataset.py b/tests/test_dataset.py
index 41bc40fc8..c15aa07fe 100644
--- a/tests/test_dataset.py
+++ b/tests/test_dataset.py
@@ -516,6 +516,301 @@ def test_assign_predictions_with_invalid_predict_fn(self):
self.assertIn("FunctionModel requires a callable predict_fn", str(context.exception))
+ def test_assign_scores_single_metric(self):
+ """
+ Test assigning a single metric score to dataset
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="test_model", model=model, __log=False)
+
+ # Assign predictions first (required for unit metrics)
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Test assign_scores with single metric
+ vm_dataset.assign_scores(vm_model, "F1")
+
+ # Check that the metric column was added
+ expected_column = f"{vm_model.input_id}_F1"
+ self.assertTrue(expected_column in vm_dataset.df.columns)
+
+ # Verify the column has the same value for all rows (scalar metric)
+ metric_values = vm_dataset.df[expected_column]
+ self.assertEqual(metric_values.nunique(), 1, "All rows should have the same metric value")
+
+ # Verify the value is reasonable for F1 score (between 0 and 1)
+ f1_value = metric_values.iloc[0]
+ self.assertTrue(0 <= f1_value <= 1, f"F1 score should be between 0 and 1, got {f1_value}")
+
+ def test_assign_scores_multiple_metrics(self):
+ """
+ Test assigning multiple metric scores to dataset
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="test_model", model=model, __log=False)
+
+ # Assign predictions first
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Test assign_scores with multiple metrics
+ metrics = ["F1", "Precision", "Recall"]
+ vm_dataset.assign_scores(vm_model, metrics)
+
+ # Check that all metric columns were added
+ for metric in metrics:
+ expected_column = f"{vm_model.input_id}_{metric}"
+ self.assertTrue(expected_column in vm_dataset.df.columns)
+
+ # Verify each column has the same value for all rows
+ metric_values = vm_dataset.df[expected_column]
+ self.assertEqual(metric_values.nunique(), 1, f"All rows should have the same {metric} value")
+
+ # Verify the value is reasonable (between 0 and 1 for these metrics)
+ metric_value = metric_values.iloc[0]
+ self.assertTrue(0 <= metric_value <= 1, f"{metric} should be between 0 and 1, got {metric_value}")
+
+ def test_assign_scores_with_parameters(self):
+ """
+ Test assigning metric scores with custom parameters
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="test_model", model=model, __log=False)
+
+ # Assign predictions first
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Test assign_scores with parameters
+ vm_dataset.assign_scores(vm_model, "ROC_AUC", **{"average": "weighted"})
+
+ # Check that the metric column was added
+ expected_column = f"{vm_model.input_id}_ROC_AUC"
+ self.assertTrue(expected_column in vm_dataset.df.columns)
+
+ # Verify the value is reasonable for ROC AUC (between 0 and 1)
+ roc_values = vm_dataset.df[expected_column]
+ roc_value = roc_values.iloc[0]
+ self.assertTrue(0 <= roc_value <= 1, f"ROC AUC should be between 0 and 1, got {roc_value}")
+
+ def test_assign_scores_full_metric_id(self):
+ """
+ Test assigning scores using full metric IDs
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="test_model", model=model, __log=False)
+
+ # Assign predictions first
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Test assign_scores with full metric ID
+ full_metric_id = "validmind.unit_metrics.classification.Accuracy"
+ vm_dataset.assign_scores(vm_model, full_metric_id)
+
+ # Check that the metric column was added with correct name
+ expected_column = f"{vm_model.input_id}_Accuracy"
+ self.assertTrue(expected_column in vm_dataset.df.columns)
+
+ # Verify the value is reasonable for accuracy (between 0 and 1)
+ accuracy_values = vm_dataset.df[expected_column]
+ accuracy_value = accuracy_values.iloc[0]
+ self.assertTrue(0 <= accuracy_value <= 1, f"Accuracy should be between 0 and 1, got {accuracy_value}")
+
+ def test_assign_scores_regression_model(self):
+ """
+ Test assigning metric scores for regression model
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0.1, 1.2, 2.3]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a regression model
+ model = LinearRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="reg_model", model=model, __log=False)
+
+ # Assign predictions first
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Test assign_scores with regression metrics
+ vm_dataset.assign_scores(vm_model, ["MeanSquaredError", "RSquaredScore"])
+
+ # Check that both metric columns were added
+ expected_columns = ["reg_model_MeanSquaredError", "reg_model_RSquaredScore"]
+ for column in expected_columns:
+ self.assertTrue(column in vm_dataset.df.columns)
+
+ # Verify R-squared is reasonable (can be negative, but typically between -1 and 1 for reasonable models)
+ r2_values = vm_dataset.df["reg_model_RSquaredScore"]
+ r2_value = r2_values.iloc[0]
+ self.assertTrue(-2 <= r2_value <= 1, f"R-squared should be reasonable, got {r2_value}")
+
+ # Verify MSE is non-negative
+ mse_values = vm_dataset.df["reg_model_MeanSquaredError"]
+ mse_value = mse_values.iloc[0]
+ self.assertTrue(mse_value >= 0, f"MSE should be non-negative, got {mse_value}")
+
+ def test_assign_scores_no_model_input_id(self):
+ """
+ Test that assign_scores raises error when model has no input_id
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Create model without input_id
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(model=model, __log=False) # No input_id provided
+
+ # Clear the input_id to test the error case
+ vm_model.input_id = None
+
+ # Should raise ValueError
+ with self.assertRaises(ValueError) as context:
+ vm_dataset.assign_scores(vm_model, "F1")
+
+ self.assertIn("Model input_id must be set", str(context.exception))
+
+ def test_assign_scores_invalid_metric(self):
+ """
+ Test that assign_scores raises error for invalid metric
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="test_model", model=model, __log=False)
+
+ # Assign predictions first
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Should raise ValueError for invalid metric
+ with self.assertRaises(ValueError) as context:
+ vm_dataset.assign_scores(vm_model, "InvalidMetricName")
+
+ self.assertIn("Metric 'InvalidMetricName' not found", str(context.exception))
+
+ def test_assign_scores_no_predictions(self):
+ """
+ Test that assign_scores raises error when predictions haven't been assigned yet
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="test_model", model=model, __log=False)
+
+ # Don't assign predictions - test that assign_scores raises error
+ # (unit metrics require predictions to be available)
+ with self.assertRaises(ValueError) as context:
+ vm_dataset.assign_scores(vm_model, "F1")
+
+ self.assertIn("No prediction column found", str(context.exception))
+
+ def test_assign_scores_column_naming_convention(self):
+ """
+ Test that assign_scores follows the correct column naming convention
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train a simple model
+ model = LogisticRegression()
+ model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_model = init_model(input_id="my_special_model", model=model, __log=False)
+
+ # Assign predictions first
+ vm_dataset.assign_predictions(model=vm_model)
+
+ # Test multiple metrics to verify naming convention
+ metrics = ["F1", "Precision", "Recall"]
+ vm_dataset.assign_scores(vm_model, metrics)
+
+ # Verify all columns follow the naming convention: {model.input_id}_{metric_name}
+ for metric in metrics:
+ expected_column = f"my_special_model_{metric}"
+ self.assertTrue(expected_column in vm_dataset.df.columns,
+ f"Expected column '{expected_column}' not found")
+
+ def test_assign_scores_multiple_models(self):
+ """
+ Test assigning scores from multiple models to same dataset
+ """
+ df = pd.DataFrame({"x1": [1, 2, 3], "x2": [4, 5, 6], "y": [0, 1, 0]})
+ vm_dataset = DataFrameDataset(
+ raw_dataset=df, target_column="y", feature_columns=["x1", "x2"]
+ )
+
+ # Train two different models
+ lr_model = LogisticRegression()
+ lr_model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_lr_model = init_model(input_id="lr_model", model=lr_model, __log=False)
+
+ rf_model = RandomForestClassifier(n_estimators=5, random_state=42)
+ rf_model.fit(vm_dataset.x, vm_dataset.y.ravel())
+ vm_rf_model = init_model(input_id="rf_model", model=rf_model, __log=False)
+
+ # Assign predictions for both models
+ vm_dataset.assign_predictions(model=vm_lr_model)
+ vm_dataset.assign_predictions(model=vm_rf_model)
+
+ # Assign scores for both models
+ vm_dataset.assign_scores(vm_lr_model, "F1")
+ vm_dataset.assign_scores(vm_rf_model, "F1")
+
+ # Check that both metric columns exist with correct names
+ lr_column = "lr_model_F1"
+ rf_column = "rf_model_F1"
+
+ self.assertTrue(lr_column in vm_dataset.df.columns)
+ self.assertTrue(rf_column in vm_dataset.df.columns)
+
+ # Verify that the values might be different (different models)
+ lr_f1 = vm_dataset.df[lr_column].iloc[0]
+ rf_f1 = vm_dataset.df[rf_column].iloc[0]
+
+ # Both should be valid F1 scores
+ self.assertTrue(0 <= lr_f1 <= 1)
+ self.assertTrue(0 <= rf_f1 <= 1)
+
if __name__ == "__main__":
unittest.main()
diff --git a/validmind/__version__.py b/validmind/__version__.py
index 3a5995d32..43ce13db0 100644
--- a/validmind/__version__.py
+++ b/validmind/__version__.py
@@ -1 +1 @@
-__version__ = "2.8.31"
+__version__ = "2.9.0"
diff --git a/validmind/tests/output.py b/validmind/tests/output.py
index 52ee23d1b..760335acb 100644
--- a/validmind/tests/output.py
+++ b/validmind/tests/output.py
@@ -45,7 +45,13 @@ def process(self, item: Any, result: TestResult) -> None:
class MetricOutputHandler(OutputHandler):
def can_handle(self, item: Any) -> bool:
- return isinstance(item, (int, float))
+ # Accept individual numbers
+ if isinstance(item, (int, float)):
+ return True
+ # Accept lists/arrays of numbers for per-row metrics
+ if isinstance(item, (list, tuple, np.ndarray)):
+ return all(isinstance(x, (int, float, np.number)) for x in item)
+ return False
def process(self, item: Any, result: TestResult) -> None:
if result.metric is not None:
diff --git a/validmind/unit_metrics/classification/individual/AbsoluteError.py b/validmind/unit_metrics/classification/individual/AbsoluteError.py
new file mode 100644
index 000000000..403e10657
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/AbsoluteError.py
@@ -0,0 +1,42 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def AbsoluteError(model: VMModel, dataset: VMDataset, **kwargs) -> List[float]:
+ """Calculates the absolute error per row for a classification model.
+
+ For classification tasks, this computes the absolute difference between
+ the true class labels and predicted class labels for each individual row.
+ For binary classification with probabilities, it can also compute the
+ absolute difference between true labels and predicted probabilities.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predictions
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row absolute errors as a list of float values
+ """
+ y_true = dataset.y
+ y_pred = dataset.y_pred(model)
+
+ # Convert to numpy arrays and ensure same data type
+ y_true = np.asarray(y_true)
+ y_pred = np.asarray(y_pred)
+
+ # For classification, compute absolute difference between true and predicted labels
+ absolute_errors = np.abs(y_true - y_pred)
+
+ # Return as a list of floats
+ return absolute_errors.astype(float).tolist()
diff --git a/validmind/unit_metrics/classification/individual/BrierScore.py b/validmind/unit_metrics/classification/individual/BrierScore.py
new file mode 100644
index 000000000..279cfa500
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/BrierScore.py
@@ -0,0 +1,56 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def BrierScore(model: VMModel, dataset: VMDataset, **kwargs) -> List[float]:
+ """Calculates the Brier score per row for a classification model.
+
+ The Brier score is a proper score function that measures the accuracy of
+ probabilistic predictions. It is calculated as the mean squared difference
+ between predicted probabilities and the actual binary outcomes.
+ Lower scores indicate better calibration.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predicted probabilities
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row Brier scores as a list of float values
+
+ Raises:
+ ValueError: If probability column is not found for the model
+ """
+ y_true = dataset.y
+
+ # Try to get probabilities
+ try:
+ y_prob = dataset.y_prob(model)
+ # For binary classification, use the positive class probability
+ if y_prob.ndim > 1 and y_prob.shape[1] > 1:
+ y_prob = y_prob[:, 1] # Use probability of positive class
+ except ValueError:
+ # Fall back to predictions if probabilities not available
+ # Convert predictions to "probabilities" (1.0 for predicted class, 0.0 for other)
+ y_pred = dataset.y_pred(model)
+ y_prob = y_pred.astype(float)
+
+ # Convert to numpy arrays and ensure same data type
+ y_true = np.asarray(y_true, dtype=float)
+ y_prob = np.asarray(y_prob, dtype=float)
+
+ # Calculate Brier score per row: (predicted_probability - actual_outcome)²
+ brier_scores = (y_prob - y_true) ** 2
+
+ # Return as a list of floats
+ return brier_scores.tolist()
diff --git a/validmind/unit_metrics/classification/individual/CalibrationError.py b/validmind/unit_metrics/classification/individual/CalibrationError.py
new file mode 100644
index 000000000..ba05c83fc
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/CalibrationError.py
@@ -0,0 +1,77 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def CalibrationError(
+ model: VMModel, dataset: VMDataset, n_bins: int = 10, **kwargs
+) -> List[float]:
+ """Calculates the calibration error per row for a classification model.
+
+ Calibration error measures how well the predicted probabilities reflect the
+ actual likelihood of the positive class. For each prediction, this computes
+ the absolute difference between the predicted probability and the empirical
+ frequency of the positive class in the corresponding probability bin.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predicted probabilities
+ n_bins: Number of bins for probability calibration, defaults to 10
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row calibration errors as a list of float values
+
+ Raises:
+ ValueError: If probability column is not found for the model
+ """
+ y_true = dataset.y
+
+ # Try to get probabilities
+ try:
+ y_prob = dataset.y_prob(model)
+ # For binary classification, use the positive class probability
+ if y_prob.ndim > 1 and y_prob.shape[1] > 1:
+ y_prob = y_prob[:, 1] # Use probability of positive class
+ except ValueError:
+ # If no probabilities available, return zeros (perfect calibration for hard predictions)
+ return [0.0] * len(y_true)
+
+ # Convert to numpy arrays
+ y_true = np.asarray(y_true, dtype=float)
+ y_prob = np.asarray(y_prob, dtype=float)
+
+ # Create probability bins
+ bin_boundaries = np.linspace(0, 1, n_bins + 1)
+ bin_lowers = bin_boundaries[:-1]
+ bin_uppers = bin_boundaries[1:]
+
+ # Calculate calibration error for each sample
+ calibration_errors = np.zeros_like(y_prob)
+
+ for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
+ # Find samples in this bin
+ in_bin = (y_prob > bin_lower) & (y_prob <= bin_upper)
+ if not np.any(in_bin):
+ continue
+
+ # Calculate empirical frequency for this bin
+ empirical_freq = np.mean(y_true[in_bin])
+
+ # Calculate average predicted probability for this bin
+ avg_predicted_prob = np.mean(y_prob[in_bin])
+
+ # Assign calibration error to all samples in this bin
+ calibration_errors[in_bin] = abs(avg_predicted_prob - empirical_freq)
+
+ # Return as a list of floats
+ return calibration_errors.tolist()
diff --git a/validmind/unit_metrics/classification/individual/ClassBalance.py b/validmind/unit_metrics/classification/individual/ClassBalance.py
new file mode 100644
index 000000000..1c38da453
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/ClassBalance.py
@@ -0,0 +1,65 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def ClassBalance(model: VMModel, dataset: VMDataset, **kwargs) -> List[float]:
+ """Calculates the class balance score per row for a classification model.
+
+ For each prediction, this returns how balanced the predicted class is in the
+ training distribution. Lower scores indicate predictions on rare classes,
+ higher scores indicate predictions on common classes. This helps understand
+ if model errors are more likely on imbalanced classes.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predictions
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row class balance scores as a list of float values
+
+ Note:
+ Scores range from 0 to 0.5, where 0.5 indicates perfectly balanced classes
+ and lower values indicate more imbalanced classes.
+ """
+ y_true = dataset.y
+ y_pred = dataset.y_pred(model)
+
+ # Convert to numpy arrays
+ y_true = np.asarray(y_true)
+ y_pred = np.asarray(y_pred)
+
+ # Calculate class frequencies in the true labels (proxy for training distribution)
+ unique_classes, class_counts = np.unique(y_true, return_counts=True)
+ class_frequencies = class_counts / len(y_true)
+
+ # Create a mapping from class to frequency
+ class_to_freq = dict(zip(unique_classes, class_frequencies))
+
+ # Calculate balance score for each prediction
+ balance_scores = []
+
+ for pred in y_pred:
+ if pred in class_to_freq:
+ freq = class_to_freq[pred]
+ # Balance score: how close to 0.5 (perfectly balanced) the frequency is
+ # Score = 0.5 - |freq - 0.5| = min(freq, 1-freq)
+ balance_score = min(freq, 1 - freq)
+ else:
+ # Predicted class not seen in true labels (very rare)
+ balance_score = 0.0
+
+ balance_scores.append(balance_score)
+
+ # Return as a list of floats
+ return balance_scores
diff --git a/validmind/unit_metrics/classification/individual/Confidence.py b/validmind/unit_metrics/classification/individual/Confidence.py
new file mode 100644
index 000000000..a60394525
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/Confidence.py
@@ -0,0 +1,52 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def Confidence(model: VMModel, dataset: VMDataset, **kwargs) -> List[float]:
+ """Calculates the prediction confidence per row for a classification model.
+
+ For binary classification, confidence is calculated as the maximum probability
+ across classes, or alternatively as the distance from the decision boundary (0.5).
+ Higher values indicate more confident predictions.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predicted probabilities
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row confidence scores as a list of float values
+
+ Raises:
+ ValueError: If probability column is not found for the model
+ """
+ # Try to get probabilities, fall back to predictions if not available
+ try:
+ y_prob = dataset.y_prob(model)
+ # For binary classification, use max probability approach
+ if y_prob.ndim > 1 and y_prob.shape[1] > 1:
+ # Multi-class: confidence is the maximum probability
+ confidence = np.max(y_prob, axis=1)
+ else:
+ # Binary classification: confidence based on distance from 0.5
+ y_prob = np.asarray(y_prob, dtype=float)
+ confidence = np.abs(y_prob - 0.5) + 0.5
+ except ValueError:
+ # Fall back to binary correctness if probabilities not available
+ y_true = dataset.y
+ y_pred = dataset.y_pred(model)
+ # If no probabilities, confidence is 1.0 for correct, 0.0 for incorrect
+ confidence = (y_true == y_pred).astype(float)
+
+ # Return as a list of floats
+ return confidence.tolist()
diff --git a/validmind/unit_metrics/classification/individual/Correctness.py b/validmind/unit_metrics/classification/individual/Correctness.py
new file mode 100644
index 000000000..81d45368c
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/Correctness.py
@@ -0,0 +1,41 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def Correctness(model: VMModel, dataset: VMDataset, **kwargs) -> List[int]:
+ """Calculates the correctness per row for a classification model.
+
+ For classification tasks, this returns 1 for correctly classified rows
+ and 0 for incorrectly classified rows. This provides a binary indicator
+ of model performance for each individual prediction.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predictions
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[int]: Per-row correctness as a list of 1s and 0s
+ """
+ y_true = dataset.y
+ y_pred = dataset.y_pred(model)
+
+ # Convert to numpy arrays
+ y_true = np.asarray(y_true)
+ y_pred = np.asarray(y_pred)
+
+ # For classification, check if predictions match true labels
+ correctness = (y_true == y_pred).astype(int)
+
+ # Return as a list of integers
+ return correctness.tolist()
diff --git a/validmind/unit_metrics/classification/individual/LogLoss.py b/validmind/unit_metrics/classification/individual/LogLoss.py
new file mode 100644
index 000000000..9a9b61a9b
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/LogLoss.py
@@ -0,0 +1,61 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def LogLoss(
+ model: VMModel, dataset: VMDataset, eps: float = 1e-15, **kwargs
+) -> List[float]:
+ """Calculates the logarithmic loss per row for a classification model.
+
+ Log loss measures the performance of a classification model where the prediction
+ is a probability value between 0 and 1. The log loss increases as the predicted
+ probability diverges from the actual label.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predicted probabilities
+ eps: Small value to avoid log(0), defaults to 1e-15
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row log loss values as a list of float values
+
+ Raises:
+ ValueError: If probability column is not found for the model
+ """
+ y_true = dataset.y
+
+ # Try to get probabilities
+ try:
+ y_prob = dataset.y_prob(model)
+ # For binary classification, use the positive class probability
+ if y_prob.ndim > 1 and y_prob.shape[1] > 1:
+ y_prob = y_prob[:, 1] # Use probability of positive class
+ except ValueError:
+ # Fall back to predictions if probabilities not available
+ # Convert predictions to "probabilities" (0.99 for correct class, 0.01 for wrong)
+ y_pred = dataset.y_pred(model)
+ y_prob = np.where(y_true == y_pred, 0.99, 0.01)
+
+ # Convert to numpy arrays and ensure same data type
+ y_true = np.asarray(y_true, dtype=float)
+ y_prob = np.asarray(y_prob, dtype=float)
+
+ # Clip probabilities to avoid log(0) and log(1)
+ y_prob = np.clip(y_prob, eps, 1 - eps)
+
+ # Calculate log loss per row: -[y*log(p) + (1-y)*log(1-p)]
+ log_loss_per_row = -(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))
+
+ # Return as a list of floats
+ return log_loss_per_row.tolist()
diff --git a/validmind/unit_metrics/classification/individual/OutlierScore.py b/validmind/unit_metrics/classification/individual/OutlierScore.py
new file mode 100644
index 000000000..1e54fbc38
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/OutlierScore.py
@@ -0,0 +1,86 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+from sklearn.ensemble import IsolationForest
+from sklearn.preprocessing import StandardScaler
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def OutlierScore(
+ model: VMModel, dataset: VMDataset, contamination: float = 0.1, **kwargs
+) -> List[float]:
+ """Calculates the outlier score per row for a classification model.
+
+ Uses Isolation Forest to identify samples that deviate significantly from
+ the typical patterns in the feature space. Higher scores indicate more
+ anomalous/outlier-like samples. This can help identify out-of-distribution
+ samples or data points that might be harder to predict accurately.
+
+ Args:
+ model: The classification model to evaluate (unused but kept for consistency)
+ dataset: The dataset containing feature data
+ contamination: Expected proportion of outliers, defaults to 0.1
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row outlier scores as a list of float values
+
+ Note:
+ Scores are normalized to [0, 1] where higher values indicate more outlier-like samples
+ """
+ # Get feature data
+ X = dataset.x_df()
+
+ # Handle case where we have no features or only categorical features
+ if X.empty or X.shape[1] == 0:
+ # Return zero outlier scores if no features available
+ return [0.0] * len(dataset.y)
+
+ # Select only numeric features for outlier detection
+ numeric_features = dataset.feature_columns_numeric
+ if not numeric_features:
+ # If no numeric features, return zero outlier scores
+ return [0.0] * len(dataset.y)
+
+ X_numeric = X[numeric_features]
+
+ # Handle missing values by filling with median
+ X_filled = X_numeric.fillna(X_numeric.median())
+
+ # Standardize features for better outlier detection
+ scaler = StandardScaler()
+ X_scaled = scaler.fit_transform(X_filled)
+
+ # Fit Isolation Forest
+ isolation_forest = IsolationForest(
+ contamination=contamination, random_state=42, n_estimators=100
+ )
+
+ # Fit the model on the data
+ isolation_forest.fit(X_scaled)
+
+ # Get anomaly scores (negative values for outliers)
+ anomaly_scores = isolation_forest.decision_function(X_scaled)
+
+ # Convert to outlier scores (0 to 1, where 1 is most outlier-like)
+ # Normalize using min-max scaling
+ min_score = np.min(anomaly_scores)
+ max_score = np.max(anomaly_scores)
+
+ if max_score == min_score:
+ # All samples have same score, no outliers detected
+ outlier_scores = np.zeros_like(anomaly_scores)
+ else:
+ # Invert and normalize: higher values = more outlier-like
+ outlier_scores = (max_score - anomaly_scores) / (max_score - min_score)
+
+ # Return as a list of floats
+ return outlier_scores.tolist()
diff --git a/validmind/unit_metrics/classification/individual/ProbabilityError.py b/validmind/unit_metrics/classification/individual/ProbabilityError.py
new file mode 100644
index 000000000..c96929820
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/ProbabilityError.py
@@ -0,0 +1,54 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def ProbabilityError(model: VMModel, dataset: VMDataset, **kwargs) -> List[float]:
+ """Calculates the probability error per row for a classification model.
+
+ For binary classification tasks, this computes the absolute difference between
+ the true class labels (0 or 1) and the predicted probabilities for each row.
+ This provides insight into how confident the model's predictions are and
+ how far off they are from the actual labels.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predicted probabilities
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row probability errors as a list of float values
+
+ Raises:
+ ValueError: If probability column is not found for the model
+ """
+ y_true = dataset.y
+
+ # Try to get probabilities, fall back to predictions if not available
+ try:
+ y_prob = dataset.y_prob(model)
+ # For binary classification, use the positive class probability
+ if y_prob.ndim > 1 and y_prob.shape[1] > 1:
+ y_prob = y_prob[:, 1] # Use probability of positive class
+ except ValueError:
+ # Fall back to predictions if probabilities not available
+ y_prob = dataset.y_pred(model)
+
+ # Convert to numpy arrays and ensure same data type
+ y_true = np.asarray(y_true, dtype=float)
+ y_prob = np.asarray(y_prob, dtype=float)
+
+ # Compute absolute difference between true labels and predicted probabilities
+ probability_errors = np.abs(y_true - y_prob)
+
+ # Return as a list of floats
+ return probability_errors.tolist()
diff --git a/validmind/unit_metrics/classification/individual/Uncertainty.py b/validmind/unit_metrics/classification/individual/Uncertainty.py
new file mode 100644
index 000000000..0d28fbac8
--- /dev/null
+++ b/validmind/unit_metrics/classification/individual/Uncertainty.py
@@ -0,0 +1,60 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List
+
+import numpy as np
+
+from validmind import tags, tasks
+from validmind.vm_models import VMDataset, VMModel
+
+
+@tasks("classification")
+@tags("classification")
+def Uncertainty(model: VMModel, dataset: VMDataset, **kwargs) -> List[float]:
+ """Calculates the prediction uncertainty per row for a classification model.
+
+ Uncertainty is measured using the entropy of the predicted probability distribution.
+ Higher entropy indicates higher uncertainty in the prediction. For binary
+ classification, maximum uncertainty occurs at probability 0.5.
+
+ Args:
+ model: The classification model to evaluate
+ dataset: The dataset containing true labels and predicted probabilities
+ **kwargs: Additional parameters (unused for compatibility)
+
+ Returns:
+ List[float]: Per-row uncertainty scores as a list of float values
+
+ Raises:
+ ValueError: If probability column is not found for the model
+ """
+ # Try to get probabilities
+ try:
+ y_prob = dataset.y_prob(model)
+
+ if y_prob.ndim > 1 and y_prob.shape[1] > 1:
+ # Multi-class: calculate entropy across all classes
+ # Clip to avoid log(0)
+ y_prob_clipped = np.clip(y_prob, 1e-15, 1 - 1e-15)
+ # Entropy: -sum(p * log(p))
+ uncertainty = -np.sum(y_prob_clipped * np.log(y_prob_clipped), axis=1)
+ else:
+ # Binary classification: calculate binary entropy
+ y_prob = np.asarray(y_prob, dtype=float)
+ # Clip to avoid log(0)
+ y_prob_clipped = np.clip(y_prob, 1e-15, 1 - 1e-15)
+ # Binary entropy: -[p*log(p) + (1-p)*log(1-p)]
+ uncertainty = -(
+ y_prob_clipped * np.log(y_prob_clipped)
+ + (1 - y_prob_clipped) * np.log(1 - y_prob_clipped)
+ )
+
+ except ValueError:
+ # If no probabilities available, assume zero uncertainty for hard predictions
+ n_samples = len(dataset.y)
+ uncertainty = np.zeros(n_samples)
+
+ # Return as a list of floats
+ return uncertainty.tolist()
diff --git a/validmind/unit_metrics/classification/individual/__init__.py b/validmind/unit_metrics/classification/individual/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/validmind/vm_models/dataset/dataset.py b/validmind/vm_models/dataset/dataset.py
index fea1566d3..9e597ba19 100644
--- a/validmind/vm_models/dataset/dataset.py
+++ b/validmind/vm_models/dataset/dataset.py
@@ -8,7 +8,7 @@
import warnings
from copy import deepcopy
-from typing import Any, Dict, Optional
+from typing import Any, Dict, List, Optional, Union
import numpy as np
import pandas as pd
@@ -458,6 +458,152 @@ def probability_column(self, model: VMModel, column_name: str = None) -> str:
return self.extra_columns.probability_column(model, column_name)
+ def assign_scores(
+ self,
+ model: VMModel,
+ metrics: Union[str, List[str]],
+ **kwargs: Dict[str, Any],
+ ) -> None:
+ """Assign computed unit metric scores to the dataset as new columns.
+
+ This method computes unit metrics for the given model and dataset, then adds
+ the computed scores as new columns to the dataset using the naming convention:
+ {model.input_id}_{metric_name}
+
+ Args:
+ model (VMModel): The model used to compute the scores.
+ metrics (Union[str, List[str]]): Single metric ID or list of metric IDs.
+ Can be either:
+ - Short name (e.g., "F1", "Precision")
+ - Full metric ID (e.g., "validmind.unit_metrics.classification.F1")
+ **kwargs: Additional parameters passed to the unit metrics.
+
+ Examples:
+ # Single metric
+ dataset.assign_scores(model, "F1")
+
+ # Multiple metrics
+ dataset.assign_scores(model, ["F1", "Precision", "Recall"])
+
+ # With parameters
+ dataset.assign_scores(model, "ROC_AUC", average="weighted")
+
+ Raises:
+ ValueError: If the model input_id is None or if metric computation fails.
+ ImportError: If unit_metrics module cannot be imported.
+ """
+ if model.input_id is None:
+ raise ValueError("Model input_id must be set to use assign_scores")
+
+ # Import unit_metrics module
+ try:
+ from validmind.unit_metrics import run_metric
+ except ImportError as e:
+ raise ImportError(
+ f"Failed to import unit_metrics module: {e}. "
+ "Make sure validmind.unit_metrics is available."
+ ) from e
+
+ # Normalize metrics to a list
+ if isinstance(metrics, str):
+ metrics = [metrics]
+
+ # Process each metric
+ for metric in metrics:
+ # Normalize metric ID
+ metric_id = self._normalize_metric_id(metric)
+
+ # Extract metric name for column naming
+ metric_name = self._extract_metric_name(metric_id)
+
+ # Generate column name
+ column_name = f"{model.input_id}_{metric_name}"
+
+ try:
+ # Run the unit metric
+ result = run_metric(
+ metric_id,
+ inputs={
+ "model": model,
+ "dataset": self,
+ },
+ params=kwargs,
+ show=False, # Don't show widget output
+ )
+
+ # Extract the metric value
+ metric_value = result.metric
+
+ # Create column values (repeat the scalar value for all rows)
+ if np.isscalar(metric_value):
+ column_values = np.full(len(self._df), metric_value)
+ else:
+ if len(metric_value) != len(self._df):
+ raise ValueError(
+ f"Metric value length {len(metric_value)} does not match dataset length {len(self._df)}"
+ )
+ column_values = metric_value
+
+ # Add the column to the dataset
+ self.add_extra_column(column_name, column_values)
+
+ logger.info(f"Added metric column '{column_name}'")
+ except Exception as e:
+ logger.error(f"Failed to compute metric {metric_id}: {e}")
+ raise ValueError(f"Failed to compute metric {metric_id}: {e}") from e
+
+ def _normalize_metric_id(self, metric: str) -> str:
+ """Normalize metric identifier to full validmind unit metric ID.
+
+ Args:
+ metric (str): Metric identifier (short name or full ID)
+
+ Returns:
+ str: Full metric ID
+ """
+ # If already a full ID, return as-is
+ if metric.startswith("validmind.unit_metrics."):
+ return metric
+
+ # Try to find the metric by short name
+ try:
+ from validmind.unit_metrics import list_metrics
+
+ available_metrics = list_metrics()
+
+ # Look for exact match with short name
+ for metric_id in available_metrics:
+ if metric_id.endswith(f".{metric}"):
+ return metric_id
+
+ # If no exact match found, raise error with suggestions
+ suggestions = [m for m in available_metrics if metric.lower() in m.lower()]
+ if suggestions:
+ raise ValueError(
+ f"Metric '{metric}' not found. Did you mean one of: {suggestions[:5]}"
+ )
+ else:
+ raise ValueError(
+ f"Metric '{metric}' not found. Available metrics: {available_metrics[:10]}..."
+ )
+
+ except ImportError as e:
+ raise ImportError(
+ f"Failed to import unit_metrics for metric lookup: {e}"
+ ) from e
+
+ def _extract_metric_name(self, metric_id: str) -> str:
+ """Extract the metric name from a full metric ID.
+
+ Args:
+ metric_id (str): Full metric ID
+
+ Returns:
+ str: Metric name
+ """
+ # Extract the last part after the final dot
+ return metric_id.split(".")[-1]
+
def add_extra_column(self, column_name, column_values=None):
"""Adds an extra column to the dataset without modifying the dataset `features` and `target` columns.
diff --git a/validmind/vm_models/result/result.py b/validmind/vm_models/result/result.py
index 6edee7bbe..3016012d5 100644
--- a/validmind/vm_models/result/result.py
+++ b/validmind/vm_models/result/result.py
@@ -178,7 +178,7 @@ class TestResult(Result):
title: Optional[str] = None
doc: Optional[str] = None
description: Optional[Union[str, DescriptionFuture]] = None
- metric: Optional[Union[int, float]] = None
+ metric: Optional[Union[int, float, List[Union[int, float]]]] = None
tables: Optional[List[ResultTable]] = None
raw_data: Optional[RawData] = None
figures: Optional[List[Figure]] = None