From 658f400d7bc02dce3d2db7b4279e1f671fbb575a Mon Sep 17 00:00:00 2001 From: Juan Date: Mon, 15 Sep 2025 10:46:37 +0200 Subject: [PATCH 1/4] Support for custom context --- .../add_context_to_llm_descriptions.ipynb | 1405 ----------------- .../custom_test_result_descriptions.ipynb | 976 ++++++++++++ validmind/ai/test_descriptions.py | 20 +- validmind/tests/run.py | 16 +- 4 files changed, 1009 insertions(+), 1408 deletions(-) delete mode 100644 notebooks/how_to/add_context_to_llm_descriptions.ipynb create mode 100644 notebooks/how_to/custom_test_result_descriptions.ipynb diff --git a/notebooks/how_to/add_context_to_llm_descriptions.ipynb b/notebooks/how_to/add_context_to_llm_descriptions.ipynb deleted file mode 100644 index e8d06f0eb..000000000 --- a/notebooks/how_to/add_context_to_llm_descriptions.ipynb +++ /dev/null @@ -1,1405 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Add context to LLM-generated test descriptions\n", - "\n", - "When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test's docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.\n", - "\n", - "In this notebook, you'll learn how to add context to the generated descriptions by providing additional information about the test or the use case. Including custom use case context is useful when you want to highlight information about the intended use and technique of the model, or the insitution policies and standards specific to your use case." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "## Contents \n", - "- [Install the ValidMind Library](#toc1_) \n", - "- [Initialize the ValidMind Library](#toc2_) \n", - " - [Get your code snippet](#toc2_1_) \n", - "- [Initialize the Python environment](#toc3_) \n", - "- [Load the sample dataset](#toc4_) \n", - " - [Preprocess the raw dataset](#toc4_1_) \n", - "- [Initializing the ValidMind objects](#toc5_) \n", - " - [Initialize the datasets](#toc5_1_) \n", - " - [Initialize a model object](#toc5_2_) \n", - " - [Assign predictions to the datasets](#toc5_3_) \n", - "- [Set custom context for test descriptions](#toc6_) \n", - " - [Review default LLM-generated descriptions](#toc6_1_) \n", - " - [Enable use case context](#toc6_2_) \n", - " - [Disable use case context](#toc6_2_1_) \n", - " - [Add test-specific context](#toc6_3_) \n", - " - [Dataset Description](#toc6_3_1_) \n", - " - [Class Imbalance](#toc6_3_2_) \n", - " - [High Cardinality](#toc6_3_3_) \n", - " - [Missing Values](#toc6_3_4_) \n", - " - [Unique Rows](#toc6_3_5_) \n", - " - [Too Many Zero Values](#toc6_3_6_) \n", - " - [IQR Outliers Table](#toc6_3_7_) \n", - " - [Descriptive Statistics](#toc6_3_8_) \n", - " - [Pearson Correlation Matrix](#toc6_3_9_) \n", - " - [Add test-specific context using the docstring](#toc6_4_)\n", - "- [Best practices for adding custom context](#toc7_)\n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Install the ValidMind Library\n", - "\n", - "To install the library:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Initialize the ValidMind Library\n", - "\n", - "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "### Get your code snippet\n", - "\n", - "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", - "\n", - "2. In the left sidebar, navigate to **Model Inventory** and click **+ Register Model**.\n", - "\n", - "3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", - "\n", - " For example, to register a model for use with this notebook, select:\n", - "\n", - " - Documentation template: `Binary classification`\n", - " - Use case: `Marketing/Sales - Attrition/Churn Management`\n", - "\n", - " You can fill in other options according to your preference.\n", - "\n", - "4. Go to **Getting Started** and click **Copy snippet to clipboard**.\n", - "\n", - "Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load your model identifier credentials from an `.env` file\n", - "\n", - "%load_ext dotenv\n", - "%dotenv .env\n", - "\n", - "# Or replace with your code snippet\n", - "\n", - "import validmind as vm\n", - "\n", - "vm.init(\n", - " # api_host=\"...\",\n", - " # api_key=\"...\",\n", - " # api_secret=\"...\",\n", - " # model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Initialize the Python environment\n", - "\n", - "After you've connected to your model register in the ValidMind Platform, let's import the necessary libraries and set up your Python environment for data analysis:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "import xgboost as xgb\n", - "import os\n", - "\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Load the sample dataset\n", - "\n", - "First, we'll import a sample ValidMind dataset and load it into a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), a two-dimensional tabular data structure that makes use of rows and columns:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Import the sample dataset from the library\n", - "\n", - "from validmind.datasets.classification import customer_churn\n", - "\n", - "print(\n", - " f\"Loaded demo dataset with: \\n\\n\\t• Target column: '{customer_churn.target_column}' \\n\\t• Class labels: {customer_churn.class_labels}\"\n", - ")\n", - "\n", - "raw_df = customer_churn.load_data()\n", - "raw_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Preprocess the raw dataset\n", - "\n", - "Then, we'll perform a number of operations to get ready for the subsequent steps:\n", - "\n", - "- **Preprocess the data:** Splits the DataFrame (`df`) into multiple datasets (`train_df`, `validation_df`, and `test_df`) using `demo_dataset.preprocess` to simplify preprocessing.\n", - "- **Separate features and targets:** Drops the target column to create feature sets (`x_train`, `x_val`) and target sets (`y_train`, `y_val`).\n", - "- **Initialize XGBoost classifier:** Creates an `XGBClassifier` object with early stopping rounds set to 10.\n", - "- **Set evaluation metrics:** Specifies metrics for model evaluation as `error`, `logloss`, and `auc`.\n", - "- **Fit the model:** Trains the model on `x_train` and `y_train` using the validation set `(x_val, y_val)`. Verbose output is disabled." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_df, validation_df, test_df = customer_churn.preprocess(raw_df)\n", - "\n", - "x_train = train_df.drop(customer_churn.target_column, axis=1)\n", - "y_train = train_df[customer_churn.target_column]\n", - "x_val = validation_df.drop(customer_churn.target_column, axis=1)\n", - "y_val = validation_df[customer_churn.target_column]\n", - "\n", - "model = xgb.XGBClassifier(early_stopping_rounds=10)\n", - "model.set_params(\n", - " eval_metric=[\"error\", \"logloss\", \"auc\"],\n", - ")\n", - "model.fit(\n", - " x_train,\n", - " y_train,\n", - " eval_set=[(x_val, y_val)],\n", - " verbose=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Initializing the ValidMind objects" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize the datasets\n", - "\n", - "Before you can run tests, you'll need to initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.\n", - "\n", - "We'll include the following arguments:\n", - "\n", - "- **`dataset`** — the raw dataset that you want to provide as input to tests\n", - "- **`input_id`** - a unique identifier that allows tracking what inputs are used when running each individual test\n", - "- **`target_column`** — a required argument if tests require access to true values. This is the name of the target column in the dataset\n", - "- **`class_labels`** — an optional value to map predicted classes to class labels\n", - "\n", - "With all datasets ready, you can now initialize the raw, training, and test datasets (`raw_df`, `train_df` and `test_df`) created earlier into their own dataset objects using [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset):" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset = vm.init_dataset(\n", - " dataset=raw_df,\n", - " input_id=\"raw_dataset\",\n", - " target_column=customer_churn.target_column,\n", - " class_labels=customer_churn.class_labels,\n", - ")\n", - "\n", - "vm_train_ds = vm.init_dataset(\n", - " dataset=train_df,\n", - " input_id=\"train_dataset\",\n", - " target_column=customer_churn.target_column,\n", - ")\n", - "\n", - "vm_test_ds = vm.init_dataset(\n", - " dataset=test_df, input_id=\"test_dataset\", target_column=customer_churn.target_column\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize a model object\n", - "\n", - "Additionally, you'll need to initialize a ValidMind model object (`vm_model`) that can be passed to other functions for analysis and tests on the data. \n", - "\n", - "Simply intialize this model object with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model):" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "vm_model = vm.init_model(\n", - " model,\n", - " input_id=\"model\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Assign predictions to the datasets\n", - "\n", - "We can now use the `assign_predictions()` method from the Dataset object to link existing predictions to any model.\n", - "\n", - "If no prediction values are passed, the method will compute predictions automatically:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_train_ds.assign_predictions(\n", - " model=vm_model,\n", - ")\n", - "\n", - "vm_test_ds.assign_predictions(\n", - " model=vm_model,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Set custom context for test descriptions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Review default LLM-generated descriptions\n", - "\n", - "By default, custom context for LLM-generated descriptions is disabled, meaning that the output will not include any additional context.\n", - "\n", - "Let's generate an initial test description for the `DatasetDescription` test for comparison with later iterations:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.DatasetDescription\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Enable use case context\n", - "\n", - "To enable custom use case context, set the `VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED` environment variable to `1`.\n", - "\n", - "This is a global setting that will affect all tests for your linked model for the duration of your ValidMind Library session:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED\"] = \"1\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Enabling use case context allows you to pass in additional context, such as information about your model, relevant regulatory requirements, or model validation targets to the LLM-generated text descriptions within `use_case_context`:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "use_case_context = \"\"\"\n", - "\n", - "This is a customer churn prediction model for a banking loan application system using XGBoost classifier. \n", - "\n", - "Key Model Information:\n", - "- Use Case: Predict customer churn risk during loan application process\n", - "- Model Type: Binary classification using XGBoost\n", - "- Critical Decision Point: Used in loan approval workflow\n", - "\n", - "Regulatory Requirements:\n", - "- Subject to model risk management review and validation\n", - "- Results require validation review for regulatory compliance\n", - "- Model decisions directly impact loan approval process\n", - "- Does this result raise any regulatory concerns?\n", - "\n", - "Validation Focus:\n", - "- Explain strengths and weaknesses of the test and the context of whether the result is acceptable.\n", - "- What does the result indicate about model reliability?\n", - "- Is the result within acceptable thresholds for loan decisioning?\n", - "- What are the implications for customer impact?\n", - "\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = use_case_context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the use case context set, generate an updated test description for the `DatasetDescription` test for comparison with default output:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.DatasetDescription\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Disable use case context\n", - "\n", - "To disable custom use case context, set the `VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED` environment variable to `0`.\n", - "\n", - "This is a global setting that will affect all tests for your linked model for the duration of your ValidMind Library session:" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED\"] = \"0\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the use case context disabled again, generate another test description for the `DatasetDescription` test for comparison with previous custom output:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.DatasetDescription\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Add test-specific context\n", - "\n", - "In addition to the model-level `use_case_context`, you're able to add test-specific context to your LLM-generated descriptions allowing you to provide test-specific validation criteria about the test that is being run.\n", - "\n", - "We'll reenable use case context by setting the `VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED` environment variable to `1`, then join the test-specific context to the use case context using the `VALIDMIND_LLM_DESCRIPTIONS_CONTEXT` environment variable." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED\"] = \"1\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Dataset Description\n", - "\n", - "Rather than relying on generic dataset result descriptions in isolation, we'll use the context to specify precise thresholds for missing values, appropriate data types for banking variables (like `CreditScore` and `Balance`), and valid value ranges based on particular business rules:" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "- Missing Values: All critical features must have less than 5% missing values (including CreditScore, Balance, Age)\n", - "- Data Types: All columns must have appropriate data types (numeric for CreditScore/Balance/Age, categorical for Geography/Gender)\n", - "- Cardinality: Categorical variables must have fewer than 50 unique values, while continuous variables should show appropriate distinct value counts (e.g., high for EstimatedSalary, exactly 2 for Boolean fields)\n", - "- Value Ranges: Numeric fields must fall within business-valid ranges (CreditScore: 300-850, Age: ≥18, Balance: ≥0)\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate an updated test description for the `DatasetDescription` test again:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.DatasetDescription\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Class Imbalance\n", - "\n", - "The following test-specific context example adds value to the LLM-generated description by providing defined risk levels to assess class representation:\n", - "\n", - "- By categorizing classes into `Low`, `Medium`, and `High Risk`, the LLM can generate more nuanced and actionable insights, ensuring that the analysis aligns with business requirements for balanced datasets.\n", - "- This approach not only highlights potential issues but also guides necessary documentation and mitigation strategies for high-risk classes." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "\n", - "• Risk Levels for Class Representation:\n", - " - Low Risk: Each class represents 20% or more of the total dataset\n", - " - Medium Risk: Each class represents between 10% and 19.9% of the total dataset\n", - " - High Risk: Any class represents less than 10% of the total dataset\n", - "\n", - "• Overall Requirement:\n", - " - All classes must achieve at least Medium Risk status to pass\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `ClassImbalance` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.ClassImbalance\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - " params={\n", - " \"min_percent_threshold\": 10,\n", - " },\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### High Cardinality\n", - "\n", - "In this below case, the context specifies a risk-based criteria for the number of distinct values in categorical features.\n", - "\n", - "This helps the LLM to generate more nuanced and actionable insights, ensuring the descriptions are more relevant to your organization's policies." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "\n", - "• Risk Levels for Distinct Values in Categorical Features:\n", - " - Low Risk: Each categorical column has fewer than 50 distinct values or less than 5% unique values relative to the total dataset size\n", - " - Medium Risk: Each categorical column has between 50 and 100 distinct values or between 5% and 10% unique values\n", - " - High Risk: Any categorical column has more than 100 distinct values or more than 10% unique values\n", - "\n", - "• Overall Requirement:\n", - " - All categorical columns must achieve at least Medium Risk status to pass\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `HighCardinality` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.HighCardinality\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - " params= {\n", - " \"num_threshold\": 100,\n", - " \"percent_threshold\": 0.1,\n", - " \"threshold_type\": \"percent\"\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Missing Values\n", - "\n", - "Here, we use the test-specific context to establish differentiated risk thresholds across features.\n", - "\n", - "Rather than applying uniform criteria, the context allows for specific requirements for critical financial features (`CreditScore`, `Balance`, `Age`)." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "Test-Specific Context for Missing Values Analysis:\n", - "\n", - "Acceptance Criteria:\n", - "\n", - "• Risk Levels for Missing Values:\n", - " - Low Risk: Less than 1% missing values in any column\n", - " - Medium Risk: Between 1% and 5% missing values\n", - " - High Risk: More than 5% missing values\n", - "\n", - "• Feature-Specific Requirements:\n", - " - Critical Features (CreditScore, Balance, Age):\n", - " * Must maintain Low Risk status\n", - " * No missing values allowed\n", - " \n", - " - Secondary Features (Tenure, NumOfProducts, EstimatedSalary):\n", - " * Must achieve at least Medium Risk status\n", - " * Up to 3% missing values acceptable\n", - "\n", - " - Categorical Features (Geography, Gender):\n", - " * Must achieve at least Medium Risk status\n", - " * Up to 5% missing values acceptable\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `MissingValues` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.MissingValues\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - " params= {\n", - " \"min_threshold\": 1\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Unique Rows\n", - "\n", - "This example context establishes variable-specific thresholds based on business expectations.\n", - "\n", - "Rather than applying uniform criteria, it recognizes that high variability is expected in features like `EstimatedSalary` (>90%) and `Balance` (>50%), while enforcing strict limits on categorical features like `Geography` (<5 values), ensuring meaningful validation aligned with banking data characteristics." - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "\n", - "• High-Variability Expected Features:\n", - " - EstimatedSalary: Must have >90% unique values\n", - " - Balance: Must have >50% unique values\n", - " - CreditScore: Must have between 5-10% unique values\n", - "\n", - "• Medium-Variability Features:\n", - " - Age: Should have between 0.5-2% unique values\n", - " - Tenure: Should have between 0.1-0.5% unique values\n", - "\n", - "• Low-Variability Features:\n", - " - Binary Features (HasCrCard, IsActiveMember, Gender, Exited): Must have exactly 2 unique values\n", - " - Geography: Must have fewer than 5 unique values\n", - " - NumOfProducts: Must have fewer than 10 unique values\n", - "\n", - "• Overall Requirements:\n", - " - Features must fall within their specified ranges to pass\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `UniqueRows` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.UniqueRows\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - " params= {\n", - " \"min_percent_threshold\": 1\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Too Many Zero Values\n", - "\n", - "Here, test-specific context is used to provide meaning and expectations for different variables:\n", - "\n", - "- For instance, zero values in `Balance` and `Tenure` indicate risk, whereas zeros in binary variables like `HasCrCard` or `IsActiveMember` are expected.\n", - "- This tailored context ensures that the analysis accurately reflects the business significance of zero values across different features." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "- Numerical Features Only: Test evaluates only continuous numeric columns (Balance, Tenure), \n", - " excluding binary columns (HasCrCard, IsActiveMember)\n", - "\n", - "- Risk Level Thresholds for Balance and Tenure:\n", - " - High Risk: More than 5% zero values\n", - " - Medium Risk: Between 3% and 5% zero values\n", - " - Low Risk: Less than 3% zero values\n", - "\n", - "- Individual Column Requirements:\n", - " - Balance: Must be Low Risk (banking context requires accurate balance tracking)\n", - " - Tenure: Must be Low or Medium Risk (some zero values acceptable for new customers)\n", - "\n", - "• Overall Test Result: Test must achieve \"Pass\" status (Low Risk) for Balance, and at least Medium Risk for Tenure\n", - "\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `TooManyZeroValues` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.TooManyZeroValues\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - " params= {\n", - " \"max_percent_threshold\": 0.03\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### IQR Outliers Table\n", - "\n", - "In this case, we use test-specific context to incorporate risk levels tailored to key variables, like `CreditScore`, `Age`, and `NumOfProducts`, that otherwise would not be considered for outlier analysis if we ran the test without context where all variables would be evaluated without any business criteria." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "- Risk Levels for Outliers:\n", - " - Low Risk: 0-50 outliers\n", - " - Medium Risk: 51-300 outliers\n", - " - High Risk: More than 300 outliers\n", - "- Feature-Specific Requirements:\n", - " - CreditScore, Age, NumOfProducts: Must maintain Low Risk status to ensure data quality and model reliability\n", - "\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `IQROutliersTable` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.IQROutliersTable\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - " params= {\n", - " \"threshold\": 1.5\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Descriptive Statistics\n", - "\n", - "Test-specific context is used in this case to provide risk-based thresholds aligned with the bank's policy.\n", - "\n", - "For instance, `CreditScore` ranges of 550-850 are considered low risk based on standard credit assessment practices, while `Balance` thresholds reflect typical retail banking ranges." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "\n", - "• CreditScore:\n", - " - Low Risk: 550-850\n", - " - Medium Risk: 450-549\n", - " - High Risk: <450 or missing\n", - " - Justification: Banking standards require reliable credit assessment\n", - "\n", - "• Age:\n", - " - Low Risk: 18-75\n", - " - Medium Risk: 76-85\n", - " - High Risk: >85 or <18\n", - " - Justification: Core banking demographic with age-appropriate products\n", - "\n", - "• Balance:\n", - " - Low Risk: 0-200,000\n", - " - Medium Risk: 200,001-250,000\n", - " - High Risk: >250,000\n", - " - Justification: Typical retail banking balance ranges\n", - "\n", - "• Tenure:\n", - " - Low Risk: 1-10 years\n", - " - Medium Risk: <1 year\n", - " - High Risk: 0 or >10 years\n", - " - Justification: Expected customer relationship duration\n", - "\n", - "• EstimatedSalary:\n", - " - Low Risk: 25,000-150,000\n", - " - Medium Risk: 150,001-200,000\n", - " - High Risk: <25,000 or >200,000\n", - " - Justification: Typical income ranges for retail banking customers\n", - "\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `DescriptiveStatistics` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.DescriptiveStatistics\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Pearson Correlation Matrix\n", - "\n", - "For this test, the context provides meaningful correlation ranges between specific variable pairs based on business criteria.\n", - "\n", - "For example, while a general correlation analysis might flag any correlation above 0.7 as concerning, the test-specific context specifies that `Balance` and `NumOfProducts` should maintain a negative correlation between -0.4 and 0, reflecting expected banking relationships." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [], - "source": [ - "test_context = \"\"\"\n", - "\n", - "Acceptance Criteria:\n", - "\n", - "• Target Variable Correlations (Exited):\n", - " - Must show correlation coefficients between ±0.1 and ±0.3 with Age, CreditScore, and Balance\n", - " - Should not exceed ±0.2 correlation with other features\n", - " - Justification: Ensures predictive power while avoiding target leakage\n", - "\n", - "• Feature Correlations:\n", - " - Balance & NumOfProducts: Must maintain correlation between -0.4 and 0\n", - " - Age & Tenure: Should show positive correlation between 0.1 and 0.3\n", - " - CreditScore & Balance: Should maintain correlation between 0.1 and 0.3\n", - "\n", - "• Binary Feature Correlations:\n", - " - HasCreditCard & IsActiveMember: Must not exceed ±0.15 correlation\n", - " - Binary features should not show strong correlations (>±0.2) with continuous features\n", - "\n", - "• Overall Requirement:\n", - " - No feature pair should exceed ±0.7 correlation to avoid multicollinearity\n", - "\n", - "\"\"\".strip()\n", - "\n", - "context = f\"\"\"\n", - "{use_case_context}\n", - "\n", - "{test_context}\n", - "\"\"\".strip()\n", - "\n", - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT\"] = context" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With the test-specific context set, generate a test description for the `PearsonCorrelationMatrix` test for review:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.data_validation.PearsonCorrelationMatrix\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset,\n", - " },\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Add test-specific context using the docstring\n", - "\n", - "Another way to customize test result descriptions is to include explicit instructions in the test docstring:\n", - "\n", - "- Unlike the environment variable methods above which require runtime configuration and is best for dynamic customization across multiple levels (global, test suite, test-specific), modifying the docstring permanently embeds instructions within the test definition itself.\n", - "- This docstring approach is ideal for tests with consistent reporting requirements that should persist across environments, ensuring standardized outputs regardless of external configuration settings.\n", - "\n", - "In the following example, we demonstrate using post-processing functions to dynamically modify the docstring at runtime, which is useful for experimentation in notebooks. However, users can alternatively hardcode these same instructions directly in the test's docstring if they want these customizations to be permanently part of the test definition without requiring additional runtime code. **Use this method when you want instructions to remain an intrinsic part of the test's definition,** eliminating the need to repeatedly set environment variables in different execution contexts.\n", - "\n", - "We'll implement a custom test with a default docstring that follows the ValidMind docstring structure. First, we will run this custom test with the default description:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED\"] = \"0\"\n", - "\n", - "@vm.test(\"my_custom_tests.MissingValues\")\n", - "def MissingValues(dataset, min_threshold = 1):\n", - " \"\"\"\n", - " Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold.\n", - "\n", - " ### Purpose\n", - "\n", - " The Missing Values test is designed to evaluate the quality of a dataset by measuring the number of missing values\n", - " across all features. The objective is to ensure that the ratio of missing data to total data is less than a\n", - " predefined threshold, defaulting to 1, in order to maintain the data quality necessary for reliable predictive\n", - " strength in a machine learning model.\n", - "\n", - " ### Test Mechanism\n", - "\n", - " The mechanism for this test involves iterating through each column of the dataset, counting missing values\n", - " (represented as NaNs), and calculating the percentage they represent against the total number of rows. The test\n", - " then checks if these missing value counts are less than the predefined `min_threshold`. The results are shown in a\n", - " table summarizing each column, the number of missing values, the percentage of missing values in each column, and a\n", - " Pass/Fail status based on the threshold comparison.\n", - "\n", - " ### Signs of High Risk\n", - "\n", - " - When the number of missing values in any column exceeds the `min_threshold` value.\n", - " - Presence of missing values across many columns, leading to multiple instances of failing the threshold.\n", - "\n", - " ### Strengths\n", - "\n", - " - Quick and granular identification of missing data across each feature in the dataset.\n", - " - Provides an effective and straightforward means of maintaining data quality, essential for constructing efficient\n", - " machine learning models.\n", - "\n", - " ### Limitations\n", - "\n", - " - Does not suggest the root causes of the missing values or recommend ways to impute or handle them.\n", - " - May overlook features with significant missing data but still less than the `min_threshold`, potentially\n", - " impacting the model.\n", - " - Does not account for data encoded as values like \"-999\" or \"None,\" which might not technically classify as\n", - " missing but could bear similar implications.\n", - " \"\"\"\n", - " df = dataset.df\n", - " missing = df.isna().sum()\n", - "\n", - " return (\n", - " [\n", - " {\n", - " \"Column\": col,\n", - " \"Number of Missing Values\": missing[col],\n", - " \"Percentage of Missing Values (%)\": missing[col] / df.shape[0] * 100,\n", - " \"Pass/Fail\": \"Pass\" if missing[col] < min_threshold else \"Fail\",\n", - " }\n", - " for col in missing.index\n", - " ],\n", - " all(missing[col] < min_threshold for col in missing.index),\n", - " )\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.MissingValues\",\n", - " inputs={\"dataset\": vm_raw_dataset},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's append custom instructions to the test docstring by using a post-processing function that modifies the default docstring before rerunning the test:" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.vm_models.result import TestResult\n", - "\n", - "# This function will append the instructions to the end of the docstring\n", - "def add_instructions(result: TestResult): \n", - " result.doc += \"\"\"\\n\\nINSTRUCTIONS: \n", - " - Generate 5 Key insights.\n", - " - Add the following note at the end of the generated output: '*NOTE: This is a sample of the data, for the full data results please look in the appendix*'\n", - " \"\"\"\n", - " return result" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You’ll notice that the description generated by the LLM is now updated to reflect the appended instructions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"my_custom_tests.MissingValues\",\n", - " inputs={\"dataset\": vm_raw_dataset},\n", - " post_process_fn=add_instructions,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If we are happy with the customization, we can now hardcode the instructions into our docstring, so they become a permanent part of the test definition and persist across any environment where the test is executed:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "os.environ[\"VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED\"] = \"0\"\n", - "\n", - "@vm.test(\"my_custom_tests.MissingValues\")\n", - "def MissingValues(dataset, min_threshold = 1):\n", - " \"\"\"\n", - " Evaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold.\n", - "\n", - " ### Purpose\n", - "\n", - " The Missing Values test is designed to evaluate the quality of a dataset by measuring the number of missing values\n", - " across all features. The objective is to ensure that the ratio of missing data to total data is less than a\n", - " predefined threshold, defaulting to 1, in order to maintain the data quality necessary for reliable predictive\n", - " strength in a machine learning model.\n", - "\n", - " ### Test Mechanism\n", - "\n", - " The mechanism for this test involves iterating through each column of the dataset, counting missing values\n", - " (represented as NaNs), and calculating the percentage they represent against the total number of rows. The test\n", - " then checks if these missing value counts are less than the predefined `min_threshold`. The results are shown in a\n", - " table summarizing each column, the number of missing values, the percentage of missing values in each column, and a\n", - " Pass/Fail status based on the threshold comparison.\n", - "\n", - " ### Signs of High Risk\n", - "\n", - " - When the number of missing values in any column exceeds the `min_threshold` value.\n", - " - Presence of missing values across many columns, leading to multiple instances of failing the threshold.\n", - "\n", - " ### Strengths\n", - "\n", - " - Quick and granular identification of missing data across each feature in the dataset.\n", - " - Provides an effective and straightforward means of maintaining data quality, essential for constructing efficient\n", - " machine learning models.\n", - "\n", - " ### Limitations\n", - "\n", - " - Does not suggest the root causes of the missing values or recommend ways to impute or handle them.\n", - " - May overlook features with significant missing data but still less than the `min_threshold`, potentially\n", - " impacting the model.\n", - " - Does not account for data encoded as values like \"-999\" or \"None,\" which might not technically classify as\n", - " missing but could bear similar implications.\n", - "\n", - " INSTRUCTIONS: \n", - " - Generate 5 Key insights.\n", - " - Add the following note at the end of the generated output: '*NOTE: This is a sample of the data, for the full data results please look in the appendix*'\n", - " \"\"\"\n", - " df = dataset.df\n", - " missing = df.isna().sum()\n", - "\n", - " return (\n", - " [\n", - " {\n", - " \"Column\": col,\n", - " \"Number of Missing Values\": missing[col],\n", - " \"Percentage of Missing Values (%)\": missing[col] / df.shape[0] * 100,\n", - " \"Pass/Fail\": \"Pass\" if missing[col] < min_threshold else \"Fail\",\n", - " }\n", - " for col in missing.index\n", - " ],\n", - " all(missing[col] < min_threshold for col in missing.index),\n", - " )\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.MissingValues\",\n", - " inputs={\"dataset\": vm_raw_dataset},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Best practices for adding custom context\n", - "\n", - "When working with test result descriptions, it's often useful to provide custom instructions that guide how a test should be interpreted. There are two main ways to add this context: through environment variables or by modifying the test's docstring. Each method has different use cases, levels of persistence, and scopes of influence. Understanding when to use each can help ensure clarity, maintainability, and consistency in your testing workflows. \n", - "\n", - "The **test’s docstring** is a natural place to embed instructions that are closely tied to the purpose and interpretation of that specific test. There are two ways to leverage docstrings:\n", - "\n", - "- Hardcoded instructions can be written directly into the test function’s docstring in the source code. These instructions become a permanent part of the test definition and will persist across all environments where the test is run. This approach is ideal when you want clear, consistent guidance that always travels with the test.\n", - "- Dynamic modification of the docstring at runtime allows you to append instructions using a post-processing function. This is useful in interactive or experimental settings—such as notebooks—where you may want to fine-tune the test’s description temporarily. These instructions are local to the current test and do not affect others\n", - "\n", - "While docstrings are localized to individual tests, **environment variables** offer a broader mechanism for injecting context at runtime. This method is best used when you want to apply the same guidance across multiple tests in a session. For example, you might define a high-level context once and have it apply globally throughout a suite of tests. However, because environment variables persist beyond a single test, they can unintentionally influence the behavior of subsequent tests unless explicitly overridden or cleared. This global scope is powerful, but it requires careful handling to avoid unexpected side effects.\n", - "\n", - "**Choosing the right approach**\n", - "- Use hardcoded docstring instructions when you want test guidance to be a permanent part of the test definition, ensuring consistency across environments.\n", - "- Use docstring post-processing when you need flexibility for local or temporary customization, particularly in experimental or development settings.\n", - "- Use environment variables to apply a shared context across multiple tests, but be mindful that the configuration will persist unless you reset it.\n", - "\n", - "This tiered approach provides both precision and flexibility—letting you decide whether context should live inside the test, be generated dynamically at runtime, or apply more broadly across test runs." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "ValidMind Library", - "language": "python", - "name": "validmind" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.15" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/how_to/custom_test_result_descriptions.ipynb b/notebooks/how_to/custom_test_result_descriptions.ipynb new file mode 100644 index 000000000..c455fd0ae --- /dev/null +++ b/notebooks/how_to/custom_test_result_descriptions.ipynb @@ -0,0 +1,976 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Customize test result descriptions\n", + "\n", + "When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test's docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.\n", + "\n", + "In this notebook, you'll learn how to take complete control over the context that drives test description generation. ValidMind provides three complementary parameters in `run_test` that give you comprehensive context management capabilities:\n", + "\n", + "\n", + "- `instructions`: Controls how the final description is structured and presented. Use this to specify formatting requirements, target different audiences (executives vs. technical teams), or ensure consistent report styles across your organization.\n", + "\n", + "- `knowledge`: Provides specific information about your organization's thresholds, business rules, and decision criteria. Use this to help the LLM understand what the results mean for your particular situation and how they should be interpreted.\n", + "\n", + "- `doc`: By default, this contains the technical mechanics of how the test works. However, for generic tests where the methodology isn't the focus, use this to describe what's actually being analyzed—the specific variables, features, or metrics being plotted and their business meaning rather than the statistical mechanics. You can also override ValidMind's built-in test documentation if you prefer different structure or language.\n", + "\n", + "Together, these parameters allow you to manage every aspect of the context that influences how the LLM interprets and presents your test results. Whether you need to align descriptions with regulatory requirements, target specific audiences, incorporate organizational policies, or ensure consistent reporting standards, these context management tools give you the flexibility to generate descriptions that perfectly match your needs while still leveraging the analytical power of AI-generated insights." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "## Contents \n", + "- [Setup](#toc1_) \n", + " - [Install the ValidMind Library](#toc1_1_) \n", + " - [Initialize the ValidMind Library](#toc1_2_) \n", + " - [Initialize the Python environment](#toc1_3_)\n", + "- [Model development](#toc2_)\n", + "- [Understanding test result descriptions](#toc3_)\n", + " - [Default LLM-generated descriptions](#toc3_1_)\n", + "- [Customizing results structure with instructions](#toc4_)\n", + " - [Simple instruction example](#toc4_1_)\n", + " - [Structured format instructions](#toc4_2_)\n", + " - [Template with LLM fill-ins](#toc4_3_)\n", + " - [Mixed static and dynamic content](#toc4_4_)\n", + "- [Contextualizing results with knowledge](#toc5_)\n", + " - [Understanding the knowledge parameter](#toc5_1_)\n", + " - [Basic knowledge usage](#toc5_2_)\n", + " - [Combining instructions and knowledge](#toc5_3_)\n", + "- [Overriding test documentation with doc parameter](#toc6_)\n", + " - [Structure of ValidMind built-in test docstrings](#toc6_1_)\n", + " - [Understanding the doc parameter](#toc6_2_)\n", + " - [Basic doc parameter usage](#toc6_3_)\n", + " - [Combining doc with instructions and knowledge](#toc6_4_)\n", + "- [Best practices for managing context](#toc7_)\n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Setup\n", + "\n", + "This section covers the basic setup required to run the examples in this notebook. We'll install ValidMind, connect to the platform, and create a customer churn model that we'll use to demonstrate the instructions and knowledge parameters throughout the examples." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Install the ValidMind Library\n", + "\n", + "To install the library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the ValidMind Library\n", + "\n", + "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n", + "\n", + "#### Get your code snippet\n", + "\n", + "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", + "\n", + "2. In the left sidebar, navigate to **Model Inventory** and click **+ Register Model**.\n", + "\n", + "3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", + "\n", + " For example, to register a model for use with this notebook, select:\n", + "\n", + " - Documentation template: `Binary classification`\n", + " - Use case: `Marketing/Sales - Attrition/Churn Management`\n", + "\n", + " You can fill in other options according to your preference.\n", + "\n", + "4. Go to **Getting Started** and click **Copy snippet to clipboard**.\n", + "\n", + "Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load your model identifier credentials from an `.env` file\n", + "\n", + "%load_ext dotenv\n", + "%dotenv .env\n", + "\n", + "# Or replace with your code snippet\n", + "\n", + "import validmind as vm\n", + "\n", + "vm.init(\n", + " # api_host=\"...\",\n", + " # api_key=\"...\",\n", + " # api_secret=\"...\",\n", + " # model=\"...\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the Python environment\n", + "\n", + "After you've connected to your model register in the ValidMind Platform, let's import the necessary libraries and set up your Python environment for data analysis:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import xgboost as xgb\n", + "import os\n", + "\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Model development\n", + "\n", + "Now we'll build the customer churn model using XGBoost and ValidMind's sample dataset. This trained model will generate the test results we'll use to demonstrate the instructions and knowledge parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load data\n", + "\n", + "First, we'll import a sample ValidMind dataset and load it into a pandas dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import the sample dataset from the library\n", + "\n", + "from validmind.datasets.classification import customer_churn\n", + "\n", + "print(\n", + " f\"Loaded demo dataset with: \\n\\n\\t• Target column: '{customer_churn.target_column}' \\n\\t• Class labels: {customer_churn.class_labels}\"\n", + ")\n", + "\n", + "raw_df = customer_churn.load_data()\n", + "raw_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Fit the model\n", + "\n", + "Then, we prepare the data and model by first splitting the DataFrame into training, validation, and test sets, then separating features from targets. An XGBoost classifier is initialized with early stopping, evaluation metrics (error, logloss, and auc) are defined, and the model is trained on the training data with validation monitoring." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_df, validation_df, test_df = customer_churn.preprocess(raw_df)\n", + "\n", + "x_train = train_df.drop(customer_churn.target_column, axis=1)\n", + "y_train = train_df[customer_churn.target_column]\n", + "x_val = validation_df.drop(customer_churn.target_column, axis=1)\n", + "y_val = validation_df[customer_churn.target_column]\n", + "\n", + "model = xgb.XGBClassifier(early_stopping_rounds=10)\n", + "model.set_params(\n", + " eval_metric=[\"error\", \"logloss\", \"auc\"],\n", + ")\n", + "model.fit(\n", + " x_train,\n", + " y_train,\n", + " eval_set=[(x_val, y_val)],\n", + " verbose=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the ValidMind objects\n", + "\n", + "Before you can run tests, you'll need to initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.\n", + "\n", + "We'll include the following arguments:\n", + "\n", + "- **`dataset`** — the raw dataset that you want to provide as input to tests\n", + "- **`input_id`** - a unique identifier that allows tracking what inputs are used when running each individual test\n", + "- **`target_column`** — a required argument if tests require access to true values. This is the name of the target column in the dataset\n", + "- **`class_labels`** — an optional value to map predicted classes to class labels\n", + "\n", + "With all datasets ready, you can now initialize the raw, training, and test datasets (`raw_df`, `train_df` and `test_df`) created earlier into their own dataset objects using [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset):" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset = vm.init_dataset(\n", + " dataset=raw_df,\n", + " input_id=\"raw_dataset\",\n", + " target_column=customer_churn.target_column,\n", + " class_labels=customer_churn.class_labels,\n", + ")\n", + "\n", + "vm_train_ds = vm.init_dataset(\n", + " dataset=train_df,\n", + " input_id=\"train_dataset\",\n", + " target_column=customer_churn.target_column,\n", + ")\n", + "\n", + "vm_test_ds = vm.init_dataset(\n", + " dataset=test_df, input_id=\"test_dataset\", target_column=customer_churn.target_column\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Additionally, you'll need to initialize a ValidMind model object (`vm_model`) that can be passed to other functions for analysis and tests on the data. \n", + "\n", + "Simply intialize this model object with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model):" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "vm_model = vm.init_model(\n", + " model,\n", + " input_id=\"model\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now use the `assign_predictions()` method from the Dataset object to link existing predictions to any model.\n", + "\n", + "If no prediction values are passed, the method will compute predictions automatically:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_train_ds.assign_predictions(\n", + " model=vm_model,\n", + ")\n", + "\n", + "vm_test_ds.assign_predictions(\n", + " model=vm_model,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Understanding test result descriptions\n", + "\n", + "Before diving into custom instructions, let's understand how ValidMind generates test descriptions by default." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Default LLM-generated descriptions\n", + "\n", + "When you run a test without custom instructions, ValidMind's LLM analyzes:\n", + "- The test results (tables, figures)\n", + "- The test's built-in documentation (docstring)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When ValidMind generates test descriptions automatically (without custom instructions), the LLM follows a series of standardized sections designed to provide comprehensive, objective analysis of test results:\n", + "\n", + "- **Test purpose:**\n", + "This section opens with a clear explanation of what the test does and why it exists. It draws from the test’s documentation and presents the purpose in accessible, straightforward language.\n", + "\n", + "- **Test mechanism:**\n", + "Here the description outlines how the test works, including its methodology, what it measures, and how those measurements are derived. For statistical tests, it also explains the meaning of each metric, how values are typically interpreted, and what ranges are expected.\n", + "\n", + "- **Test strengths:**\n", + "This part highlights the value of the test by pointing out its key strengths and the scenarios where it is most useful. It also notes the kinds of insights it can provide that other tests may not capture.\n", + "\n", + "- **Test limitations:**\n", + "Limitations focus on both technical constraints and interpretation challenges. The text notes when results should be treated with caution and highlights specific risk indicators tied to the test type.\n", + "\n", + "- **Results interpretation:**\n", + "The results section explains how to read the outputs, whether tables or figures, and clarifies what each column, axis, or metric means. It also points out key data points, units of measurement, and any notable observations that help frame interpretation.\n", + "\n", + "- **Key insights:**\n", + "Insights are listed in bullet points, moving from broad to specific. Each one has a clear title, includes relevant numbers or ranges, and ensures that all important aspects of the results are addressed.\n", + "\n", + "- **Conclusions**:\n", + "The conclusion ties the insights together into a coherent narrative. It synthesizes the findings into objective technical takeaways and emphasizes what the results reveal about the model or data.\n", + "\n", + "\n", + "Let's see a default description:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Customizing results structure with instructions\n", + "\n", + "While the default descriptions are designed to be comprehensive, there are many cases where you might want to tailor them for your specific context. Customizing test results allows you to shape descriptions to fit your organization’s standards and practical needs. This can involve adjusting report formats, applying specific risk rating scales, adding mandatory disclaimer text, or emphasizing particular metrics.\n", + "\n", + "The `instructions` parameter is what enables this flexibility by adapting the generated descriptions to different audiences and test types. Executives often need concise summaries that emphasize overall risk, data scientists look for detailed explanations of the methodology behind tests, and compliance teams require precise language that aligns with regulatory expectations. Different test types also demand different emphases: performance metrics may benefit from technical breakdowns, while validation checks might require risk-focused narratives." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Simple instruction example\n", + "\n", + "Let's start with simple examples of the `instructions` parameter. Here's how to provide basic guidance to the LLM-generated descriptions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "simple_instructions = \"\"\"\n", + "Please focus on business impact and provide a concise summary. \n", + "Include specific actionable recommendations.\n", + "\"\"\"\n", + "\n", + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " instructions=simple_instructions,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Structured format instructions\n", + "\n", + "You can request specific formatting and structure:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "structured_instructions = \"\"\"\n", + "Please structure your analysis using the following format:\n", + "\n", + "### Executive Summary\n", + "- One sentence overview of the test results\n", + "\n", + "### Key Findings\n", + "- Bullet points with the most important insights\n", + "- Include specific percentages and thresholds\n", + "\n", + "### Risk Assessment\n", + "- Classify risk level as Low/Medium/High\n", + "- Explain reasoning for the risk classification\n", + "\n", + "### Recommendations\n", + "- Specific actionable next steps\n", + "- Priority level for each recommendation\n", + "\n", + "\"\"\"\n", + "\n", + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " instructions=structured_instructions,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Template with LLM fill-ins\n", + "\n", + "One of the most powerful features is combining hardcoded text with LLM-generated content using placeholders. This allows you to ensure specific information is always included while still getting intelligent analysis of the results.\n", + "\n", + "Create a template where specific sections are filled by the LLM:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "template_instructions = \"\"\"\n", + "Please generate the description using this exact template. \n", + "Fill in the [PLACEHOLDER] sections with your analysis:\n", + "\n", + "---\n", + "**VALIDATION REPORT: DATA QUALITY ASSESSMENT**\n", + "\n", + "**Dataset ID:** raw_dataset\n", + "**Validation Type:** Missing Values Analysis\n", + "**Reviewer:** ValidMind AI Analysis\n", + "\n", + "**EXECUTIVE SUMMARY:**\n", + "[PROVIDE_2_SENTENCE_SUMMARY_OF_RESULTS]\n", + "\n", + "**KEY METRICS:**\n", + "[ANALYZE_AND_LIST_TOP_3_MOST_IMPORTANT_FINDINGS_WITH_VALUES]\n", + "\n", + "**DATA QUALITY ASSESSMENT:**\n", + "[DETAILED_ANALYSIS_OF_MISSING_VALUES_PATTERNS_AND_IMPACT]\n", + "\n", + "**RISK RATING:** [ASSIGN_LOW_MEDIUM_HIGH_RISK_WITH_JUSTIFICATION]\n", + "\n", + "**RECOMMENDATIONS:**\n", + "[PROVIDE_SPECIFIC_ACTIONABLE_RECOMMENDATIONS_NUMBERED_LIST]\n", + "\n", + "**VALIDATION STATUS:** [PASS_CONDITIONAL_PASS_OR_FAIL_WITH_REASONING]\n", + "\n", + "---\n", + "*This report was generated using ValidMind's automated validation platform.*\n", + "*For questions about this analysis, contact the Data Science team.*\n", + "---\n", + "\n", + "Important: Use the exact template structure above and fill in each [PLACEHOLDER] section.\n", + "\"\"\"\n", + "\n", + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " instructions=template_instructions\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Mixed static and dynamic content\n", + "\n", + "Combine mandatory text with intelligent analysis:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Mixed static and dynamic content\n", + "mixed_content_instructions =\"\"\"\n", + "Return ONLY the assembled content in plain Markdown paragraphs and lists.\n", + "Do NOT include any headings or titles (no lines starting with '#'), labels,\n", + "XML-like tags (, ), variable names, or code fences.\n", + "Do NOT repeat or paraphrase these instructions. Start the first line with the\n", + "first mandatory sentence below—no preface.\n", + "\n", + "You MUST include all MANDATORY blocks verbatim (exact characters, spacing, and punctuation).\n", + "You MUST replace PLACEHOLDER blocks with the requested content.\n", + "Between blocks, include exactly ONE blank line.\n", + "\n", + "MANDATORY BLOCK A (include verbatim):\n", + "This data validation assessment was conducted in accordance with the \n", + "XYZ Bank Model Risk Management Policy (Document ID: MRM-2024-001). \n", + "All findings must be reviewed by the Model Validation Team before \n", + "model deployment.\n", + "\n", + "PLACEHOLDER BLOCK B (replace with prose paragraphs; no headings):\n", + "[Provide detailed analysis of the test results, including specific values, \n", + "interpretations, and implications for model quality. Focus on data quality \n", + "aspects and potential issues that could affect model performance.]\n", + "\n", + "MANDATORY BLOCK C (include verbatim):\n", + "IMPORTANT: This automated analysis is supplementary to human expert review. \n", + "All high-risk findings require immediate escalation to the Chief Risk Officer. \n", + "Model deployment is prohibited until all Medium and High risk items are resolved.\n", + "\n", + "PLACEHOLDER BLOCK D (replace with a numbered list only):\n", + "[Create a numbered list of specific action items with responsible parties \n", + "and suggested timelines for resolution.]\n", + "\n", + "MANDATORY BLOCK E (include verbatim):\n", + "Validation performed using ValidMind Platform v2.0 | \n", + "Next review required: [30 days from test date] | \n", + "Contact: model-risk@xyzbank.com\n", + "\n", + "Compliance checks BEFORE you finalize your answer:\n", + "- No headings or titles present (no '#' anywhere).\n", + "- No tags (, ) or labels (e.g., \"BLOCK A\") in the output.\n", + "- All three MANDATORY blocks included exactly as written.\n", + "- PLACEHOLDER B replaced with prose; PLACEHOLDER D replaced with a numbered list.\n", + "- Exactly one blank line between each block.\n", + "\"\"\"\n", + "\n", + "\n", + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " instructions=mixed_content_instructions\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " instructions=mixed_content_instructions,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Contextualizing results with knowledge\n", + "\n", + "While the `instructions` parameter controls *how* your test descriptions are formatted and structured, the new `knowledge` parameter provides *context* about what the results mean for your specific business situation. Think of `instructions` as the \"format template\" and `knowledge` as the \"business context\" that helps the LLM understand what matters most in your organization.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Understanding the knowledge parameter\n", + "\n", + "The `knowledge` parameter can be used to add any background information that helps put the test results into context. For example, you might include business priorities and constraints that shape how results are interpreted, risk tolerance levels or acceptance criteria specific to your organization, regulatory requirements that influence what counts as acceptable performance, or details about the intended use case of the model in production. These are just examples—the parameter is flexible and can capture whatever context is most relevant to your needs.\n", + "\n", + "**Key difference:**\n", + "- `instructions`: \"Write a 3-paragraph executive summary\"\n", + "\n", + "- `knowledge`: \"If Accuracy is above 0.85 but Class 1 Recall falls below 0.60, the model should be considered high risk\"\n", + "\n", + "When used together, these parameters create descriptions that don’t just report the Recall or Accuracy measures for Class 1, but explain that because Accuracy is above 0.85 while Recall falls below 0.60, the model should be treated as high risk for your business." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Basic knowledge usage\n", + "\n", + "Here's how business context transforms the interpretation of our classifier results:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "simple_knowledge = \"\"\"\n", + "MODEL CONTEXT:\n", + "- Class 0 = Customer stays (retains banking relationship)\n", + "- Class 1 = Customer churns (closes accounts, leaves bank)\n", + "\n", + "DECISION RULES:\n", + "- ROC AUC >0.85: APPROVE deployment\n", + "- ROC AUC <0.85: REJECT model\n", + "\n", + "CHURN DETECTION RULES:\n", + "- Recall >50% for churning customers: Good - use high-touch retention \n", + "- Recall <50% for churning customers: Poor - retention program will fail\n", + "\"\"\"\n", + "\n", + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " knowledge=simple_knowledge,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Combining instructions and knowledge\n", + "\n", + "Here's how combining both parameters creates targeted analysis of our churn model performance:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Executive decision instructions\n", + "executive_instructions = \"\"\"\n", + "Create a GO/NO-GO decision memo:\n", + "\n", + "**THRESHOLD ANALYSIS:** [Pass/Fail against specific thresholds]\n", + "**BUSINESS IMPACT:** [Dollar impact of current performance] \n", + "**DEPLOYMENT DECISION:** [APPROVE/CONDITIONAL/REJECT]\n", + "**REQUIRED ACTIONS:** [Specific next steps with timelines]\n", + "\n", + "Be definitive - use the thresholds to make clear recommendations.\n", + "\"\"\"\n", + "\n", + "# Retail banking with hard thresholds\n", + "retail_thresholds = \"\"\"\n", + "RETAIL BANKING CONTEXT:\n", + "- Class 0 = Customer retention (keeps checking/savings accounts)\n", + "- Class 1 = Customer churn (closes accounts, switches banks)\n", + "\n", + "REGULATORY THRESHOLDS:\n", + "- AUC >0.80: Meets regulatory model standards (OUR 0.854: PASS)\n", + "- Churn Recall >55%: Adequate churn detection (OUR 47%: FAIL)\n", + "- Churn Precision >65%: Cost-effective targeting (OUR 73%: PASS)\n", + "\n", + "DEPLOYMENT MATRIX:\n", + "- All 3 Pass: FULL DEPLOYMENT\n", + "- 2 Pass: CONDITIONAL DEPLOYMENT\n", + "- <2 Pass: REJECT MODEL\n", + "\n", + "\"\"\"\n", + "\n", + "vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"dataset\": vm_test_ds,\n", + " \"model\": vm_model,\n", + " },\n", + " instructions=executive_instructions,\n", + " knowledge=retail_thresholds,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Overriding test documentation with doc parameter\n", + "\n", + "Each test, whether built-in or customized, includes a built-in docstring that serves as its default documentation. This docstring usually explains what the test does and what it outputs. In many cases, especially for specialized tests with well-defined purposes—the default docstring is already useful and sufficient." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Structure of ValidMind built-in test docstrings\n", + "\n", + "Every ValidMind built-in test includes a docstring that serves as its default documentation. This docstring follows a consistent structure so that both users and the LLM can rely on a predictable format. While the content varies depending on the type of test—for example, highly specific tests like SHAP values or PSI provide technical detail, whereas generic tests like descriptive statistics or histograms are more general—the overall layout remains the same.\n", + "\n", + "A typical docstring contains the following sections:\n", + "\n", + "- **Overview:**\n", + "A short description of what the test does and what kind of output it generates.\n", + "\n", + "- **Purpose:**\n", + "Explains why the test exists and what it is designed to evaluate. This section provides the context for the test’s role in model documentation, often describing the intended use cases or the kind of insights it supports.\n", + "\n", + "- **Test mechanism**:\n", + "Describes how the test works internally. This includes the approach or methodology, what inputs are used, how results are calculated or visualized, and the logic behind the test’s implementation.\n", + "\n", + "- **Signs of high risks:**\n", + "Outlines risk indicators that are specific to the test. These highlight situations where results should be interpreted with caution—for example, imbalances in distributions or errors in processing steps.\n", + "\n", + "- **Strengths:**\n", + "Highlights the capabilities and benefits of the test, explaining what makes it particularly useful and what kinds of insights it provides that may not be captured elsewhere.\n", + "\n", + "- **Limitations:**\n", + "Discusses the constraints of the test, including technical shortcomings, interpretive challenges, and situations where the results might be misleading or incomplete.\n", + "\n", + "This structure ensures that all built-in tests provide a comprehensive explanation of their purpose, mechanics, strengths, and limitations. For more generic tests, the docstring may read as boilerplate information about the test’s mechanics. In these cases, the `doc` parameter can be used to override the docstring with context that is more relevant to the dataset, feature, or business use case under analysis." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Understanding the doc parameter\n", + "\n", + "Overriding the docstring with the `doc` parameter is particularly valuable for more generic tests, where the default text often focuses on the mechanics of producing an output rather than the data or variable being analyzed. For example, instead of including documentation about the details about the methodology used to compute an histogram, you may want to document the business meaning of the feature being visualized, its expected distribution, or what to pay attention to. Similarly, when generating a descriptive statistics table, you may prefer documentation that describes the dataset under review. \n", + "\n", + "Customizing the doc, allows you to shift the focus of the explanation from the test machinery to the aspects of the data that matter most for your audience, while still relying on the built-in docstring for cases where the default detail is already fit for purpose.\n", + "\n", + "**When to override**\n", + "\n", + "For tests like histograms or descriptive statistics where the statistical methodology is standard and uninteresting, replace the generic documentation with meaningful descriptions of the variables being analyzed. Also use this to customize ValidMind's built-in test documentation when you want different terminology, structure, or emphasis than what's provided by default." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Basic doc parameter usage" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "custom_doc = \"\"\"\n", + "This test evaluates customer churn prediction model performance specifically \n", + "for retail banking applications. The analysis focuses on classification \n", + "metrics relevant to customer retention programs and regulatory compliance \n", + "requirements under our internal Model Risk Management framework.\n", + "\n", + "Key metrics analyzed:\n", + "- Precision: Accuracy of churn predictions to minimize wasted retention costs\n", + "- Recall: Coverage of actual churners to maximize retention program effectiveness \n", + "- F1-Score: Balanced measure considering both precision and recall\n", + "- ROC AUC: Overall discriminatory power for regulatory model approval\n", + "\n", + "Results inform deployment decisions for automated retention campaigns.\n", + "\"\"\"\n", + "\n", + "result = vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"model\": vm_model,\n", + " \"dataset\": vm_test_ds\n", + " },\n", + " doc=custom_doc\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Combining doc with instructions and knowledge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# All three parameters working together\n", + "banking_doc = \"\"\"\n", + "Customer Churn Risk Assessment Test for Retail Banking.\n", + "Evaluates model's ability to identify customers likely to close accounts \n", + "and switch to competitor banks within 12 months.\n", + "- Class 0 = Customer retention (maintains banking relationship)\n", + "- Class 1 = Customer churn (closes primary accounts)\n", + "\"\"\"\n", + "\n", + "executive_instructions = \"\"\"\n", + "Format as a risk committee briefing:\n", + "**TEST DESCRIPTION:** [Test description]\n", + "**RISK ASSESSMENT:** [Model risk level]\n", + "**REGULATORY STATUS:** [Compliance with banking regulations]\n", + "**BUSINESS RECOMMENDATION:** [Deploy/Hold/Reject with rationale]\n", + "\"\"\"\n", + "\n", + "banking_knowledge = \"\"\"\n", + "REGULATORY CONTEXT:\n", + "- OCC guidance requires AUC >0.80 for model approval\n", + "- Our threshold: Churn recall >50% for retention program viability\n", + "\"\"\"\n", + "\n", + "result = vm.tests.run_test(\n", + " \"validmind.model_validation.sklearn.ClassifierPerformance\",\n", + " inputs={\n", + " \"model\": vm_model,\n", + " \"dataset\": vm_test_ds\n", + " },\n", + " doc=banking_doc,\n", + " instructions=executive_instructions,\n", + " knowledge=banking_knowledge\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Best practices for managing context\n", + "\n", + "When using `instructions`, `knowledge`, and `doc` parameters together, follow these guidelines to create effective, consistent, and maintainable test descriptions.\n", + "\n", + "**Choose the right parameter for each need:**\n", + "\n", + "- Use `doc` for technical corrections when you need to fix or clarify test methodology, override ValidMind's built-in documentation with your preferred structure or terminology, replace generic test mechanics with meaningful descriptions of variables and features being analyzed, or provide domain-specific context for regulatory compliance. \n", + "\n", + "- Apply `knowledge` for business rules such as performance thresholds and decision criteria, specific business context like customer economics and operational constraints, aggressive threshold-driven decision logic that automatically determines deployment recommendations, and industry-specific requirements like regulatory frameworks or risk tolerances.\n", + "\n", + "- Leverage `instructions` for audience targeting to control format and presentation style, create structured templates with specific sections and placeholders for LLM fill-ins, combine hardcoded mandatory text with dynamic analysis, and ensure consistent organizational reporting standards across different stakeholder groups.\n", + "\n", + "**Avoid redundancy:**\n", + "\n", + "Don't repeat the same information across multiple parameters, as each parameter should add unique value to the description generation. If content overlaps, choose the most appropriate parameter for that information to maintain clarity and prevent conflicting or duplicate guidance in your test descriptions.\n", + "\n", + "**Increasing consistency and grounding:**\n", + "\n", + "Since LLMs can produce variable responses, use hardcoded sections in your instructions for content that requires no variability, combined with specific placeholders for data you trust the LLM to generate. For example, include mandatory disclaimers, policy references, and fixed formatting exactly as written, while using placeholders like `[ANALYZE_PERFORMANCE_METRICS]` for dynamic content. This approach ensures critical information appears consistently while still leveraging the LLM's analytical capabilities.\n", + "\n", + "Use `doc` and `knowledge` parameters to anchor test descriptions in your specific domain and business context, preventing the LLM from generating generic or inappropriate interpretations. Then use `instructions` to explicitly direct the LLM to ground its analysis in this provided context, such as \"Base all recommendations on the thresholds specified in the knowledge section\" or \"Interpret all metrics according to the test description provided.\"\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/validmind/ai/test_descriptions.py b/validmind/ai/test_descriptions.py index a62f6b7bf..8c8b16e59 100644 --- a/validmind/ai/test_descriptions.py +++ b/validmind/ai/test_descriptions.py @@ -74,6 +74,8 @@ def generate_description( metric: Union[float, int] = None, figures: List[Figure] = None, title: Optional[str] = None, + instructions: Optional[str] = None, + knowledge: Optional[str] = None, ): """Generate the description for the test results.""" from validmind.api_client import generate_test_result_description @@ -122,7 +124,8 @@ def generate_description( "figures": [ figure._get_b64_url() for figure in ([] if tables else figures) ], - "context": _get_llm_global_context(), + "knowledge": knowledge, + "instructions": instructions, } )["content"] @@ -134,6 +137,8 @@ def background_generate_description( figures: List[Figure] = None, metric: Union[int, float] = None, title: Optional[str] = None, + instructions: Optional[str] = None, + knowledge: Optional[str] = None, ): def wrapped(): try: @@ -145,6 +150,8 @@ def wrapped(): figures=figures, metric=metric, title=title, + instructions=instructions, + knowledge=knowledge, ), True, ) @@ -174,6 +181,8 @@ def get_result_description( metric: Union[int, float] = None, should_generate: bool = True, title: Optional[str] = None, + instructions: Optional[str] = None, + knowledge: Optional[str] = None, ): """Get the metadata dictionary for a test or metric result. @@ -195,10 +204,17 @@ def get_result_description( figures (List[Figure]): The figures to attach to the test suite result. metric (Union[int, float]): Unit metrics attached to the test result. should_generate (bool): Whether to generate the description or not. Defaults to True. + instructions (Optional[str]): Instructions for the LLM to generate the description. + knowledge (Optional[str]): Knowledge base for the LLM to generate the description. Returns: str: The description to be logged with the test results. """ + # Backwards compatibility: parameter instructions override environment variable + env_instructions = _get_llm_global_context() + # Parameter instructions take precedence and override environment variable + _instructions = instructions if instructions is not None else env_instructions + # Check the feature flag first, then the environment variable llm_descriptions_enabled = ( client_config.can_generate_llm_test_descriptions() @@ -216,6 +232,8 @@ def get_result_description( figures=figures, metric=metric, title=title, + instructions=_instructions, + knowledge=knowledge, ) else: diff --git a/validmind/tests/run.py b/validmind/tests/run.py index 2a32a3a81..cca3d8db5 100644 --- a/validmind/tests/run.py +++ b/validmind/tests/run.py @@ -274,6 +274,7 @@ def _run_test( inputs: Dict[str, Any], params: Dict[str, Any], title: Optional[str] = None, + doc: Optional[str] = None, ): """Run a standard test and return a TestResult object""" test_func = load_test(test_id) @@ -285,10 +286,13 @@ def _run_test( raw_result = test_func(**input_kwargs, **param_kwargs) + # Use custom doc if provided, otherwise use the test function's docstring + _doc = doc if doc is not None else getdoc(test_func) + return build_test_result( outputs=raw_result, test_id=test_id, - test_doc=getdoc(test_func), + test_doc=_doc, inputs=input_kwargs, params=param_kwargs, title=title, @@ -308,6 +312,9 @@ def run_test( # noqa: C901 title: Optional[str] = None, post_process_fn: Union[Callable[[TestResult], None], None] = None, show_params: bool = True, + instructions: Optional[str] = None, + knowledge: Optional[str] = None, + doc: Optional[str] = None, **kwargs, ) -> TestResult: """Run a ValidMind or custom test @@ -333,6 +340,9 @@ def run_test( # noqa: C901 title (str, optional): Custom title for the test result post_process_fn (Callable[[TestResult], None], optional): Function to post-process the test result show_params (bool, optional): Whether to include parameter values in figure titles for comparison tests. Defaults to True. + instructions (str, optional): Instructions for the LLM to generate a description. Defaults to None. + knowledge (str, optional): Knowledge base for the LLM to generate the description. Defaults to None. + doc (str, optional): Custom docstring to override the test's built-in documentation. Defaults to None. Returns: TestResult: A TestResult object containing the test results @@ -388,7 +398,7 @@ def run_test( # noqa: C901 ) else: - result = _run_test(test_id, inputs, params, title) + result = _run_test(test_id, inputs, params, title, doc) end_time = time.perf_counter() result.metadata = _get_run_metadata(duration_seconds=end_time - start_time) @@ -405,6 +415,8 @@ def run_test( # noqa: C901 metric=result.metric, should_generate=generate_description, title=title, + instructions=instructions, + knowledge=knowledge, ) if show: From 59308ae72dd34bb3b8c4612baa2a9282135e3c2d Mon Sep 17 00:00:00 2001 From: Juan Date: Thu, 18 Sep 2025 12:12:26 +0200 Subject: [PATCH 2/4] Passing context parameters via dictionary --- .../custom_test_result_descriptions.ipynb | 144 +++++++++--------- validmind/ai/test_descriptions.py | 14 +- validmind/tests/run.py | 49 +++++- 3 files changed, 124 insertions(+), 83 deletions(-) diff --git a/notebooks/how_to/custom_test_result_descriptions.ipynb b/notebooks/how_to/custom_test_result_descriptions.ipynb index c455fd0ae..a846e8278 100644 --- a/notebooks/how_to/custom_test_result_descriptions.ipynb +++ b/notebooks/how_to/custom_test_result_descriptions.ipynb @@ -8,16 +8,16 @@ "\n", "When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test's docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.\n", "\n", - "In this notebook, you'll learn how to take complete control over the context that drives test description generation. ValidMind provides three complementary parameters in `run_test` that give you comprehensive context management capabilities:\n", + "In this notebook, you'll learn how to take complete control over the context that drives test description generation. ValidMind provides a `context` parameter in `run_test` that accepts a dictionary with three complementary keys for comprehensive context management:\n", "\n", "\n", "- `instructions`: Controls how the final description is structured and presented. Use this to specify formatting requirements, target different audiences (executives vs. technical teams), or ensure consistent report styles across your organization.\n", "\n", - "- `knowledge`: Provides specific information about your organization's thresholds, business rules, and decision criteria. Use this to help the LLM understand what the results mean for your particular situation and how they should be interpreted.\n", + "- `additional_context`: Provides any background information you want the LLM to consider when analyzing results. This could include business priorities, acceptance thresholds, regulatory requirements, domain expertise, use case details, model purpose, stakeholder concerns, or any other contextual information that helps the LLM better understand and interpret your specific situation.\n", "\n", - "- `doc`: By default, this contains the technical mechanics of how the test works. However, for generic tests where the methodology isn't the focus, use this to describe what's actually being analyzed—the specific variables, features, or metrics being plotted and their business meaning rather than the statistical mechanics. You can also override ValidMind's built-in test documentation if you prefer different structure or language.\n", + "- `test_description`: By default, this contains the technical mechanics of how the test works. However, for generic tests where the methodology isn't the focus, use this to describe what's actually being analyzed—the specific variables, features, or metrics being plotted and their business meaning rather than the statistical mechanics. You can also override ValidMind's built-in test documentation if you prefer different structure or language.\n", "\n", - "Together, these parameters allow you to manage every aspect of the context that influences how the LLM interprets and presents your test results. Whether you need to align descriptions with regulatory requirements, target specific audiences, incorporate organizational policies, or ensure consistent reporting standards, these context management tools give you the flexibility to generate descriptions that perfectly match your needs while still leveraging the analytical power of AI-generated insights." + "Together, these context parameters allow you to manage every aspect of how the LLM interprets and presents your test results. Whether you need to align descriptions with regulatory requirements, target specific audiences, incorporate organizational policies, or ensure consistent reporting standards, this context management approach gives you the flexibility to generate descriptions that perfectly match your needs while still leveraging the analytical power of AI-generated insights." ] }, { @@ -38,15 +38,15 @@ " - [Structured format instructions](#toc4_2_)\n", " - [Template with LLM fill-ins](#toc4_3_)\n", " - [Mixed static and dynamic content](#toc4_4_)\n", - "- [Contextualizing results with knowledge](#toc5_)\n", - " - [Understanding the knowledge parameter](#toc5_1_)\n", - " - [Basic knowledge usage](#toc5_2_)\n", - " - [Combining instructions and knowledge](#toc5_3_)\n", - "- [Overriding test documentation with doc parameter](#toc6_)\n", + "- [Enriching results with additional context](#toc5_)\n", + " - [Understanding the additional context parameter](#toc5_1_)\n", + " - [Basic additional context usage](#toc5_2_)\n", + " - [Combining instructions and additional context](#toc5_3_)\n", + "- [Overriding test documentation with test description parameter](#toc6_)\n", " - [Structure of ValidMind built-in test docstrings](#toc6_1_)\n", - " - [Understanding the doc parameter](#toc6_2_)\n", - " - [Basic doc parameter usage](#toc6_3_)\n", - " - [Combining doc with instructions and knowledge](#toc6_4_)\n", + " - [Understanding the test description parameter](#toc6_2_)\n", + " - [Basic test description parameter usage](#toc6_3_)\n", + " - [Combining test description with instructions and additional context](#toc6_4_)\n", "- [Best practices for managing context](#toc7_)\n", ":::\n", "