diff --git a/notebooks/code_sharing/plots_and_stats_demo.ipynb b/notebooks/code_sharing/plots_and_stats_demo.ipynb
new file mode 100644
index 000000000..b41188ae0
--- /dev/null
+++ b/notebooks/code_sharing/plots_and_stats_demo.ipynb
@@ -0,0 +1,708 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "# Comprehensive Guide: ValidMind Plots and Statistics Tests\n",
+ "\n",
+ "This notebook demonstrates all the available tests from the `validmind.plots` and `validmind.stats` modules. Theseized tests provide powerful visualization and statistical analysis capabilities for any dataset.\n",
+ "\n",
+ "## What You'll Learn\n",
+ "\n",
+ "In this notebook, we'll explore:\n",
+ "\n",
+ "1. **Plotting Tests**: Visual analysis tools for data exploration\n",
+ " - CorrelationHeatmap\n",
+ " - HistogramPlot\n",
+ " - BoxPlot\n",
+ " - ViolinPlot\n",
+ "\n",
+ "2. **Statistical Tests**: Comprehensive statistical analysis tools\n",
+ " - DescriptiveStats\n",
+ " - CorrelationAnalysis\n",
+ " - NormalityTests\n",
+ " - OutlierDetection\n",
+ "\n",
+ "Each test is highly configurable and can be adapted to different datasets and use cases.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Comprehensive Guide: ValidMind Plots and Statistics Tests\n",
+ "\n",
+ "This notebook demonstrates all the available tests from the `validmind.plots` and `validmind.stats` modules. These generalized tests provide powerful visualization and statistical analysis capabilities for any dataset.\n",
+ "\n",
+ "## What You'll Learn\n",
+ "\n",
+ "In this notebook, we'll explore:\n",
+ "\n",
+ "1. **Plotting Tests**: Visual analysis tools for data exploration\n",
+ " - CorrelationHeatmap\n",
+ " - HistogramPlot\n",
+ " - BoxPlot\n",
+ " - ViolinPlot\n",
+ "\n",
+ "2. **Statistical Tests**: Comprehensive statistical analysis tools\n",
+ " - DescriptiveStats\n",
+ " - CorrelationAnalysis\n",
+ " - NormalityTests\n",
+ " - OutlierDetection\n",
+ "\n",
+ "Each test is highly configurable and can be adapted to different datasets and use cases.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## About ValidMind\n",
+ "\n",
+ "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## Setting up\n",
+ "\n",
+ "### Install the ValidMind Library\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install -q validmind\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "### Initialize the ValidMind Library\n",
+ "\n",
+ "For this demonstration, we'll initialize ValidMind in demo mode.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Load your model identifier credentials from an `.env` file\n",
+ "\n",
+ "%load_ext dotenv\n",
+ "%dotenv .env\n",
+ "\n",
+ "# Or replace with your code snippet\n",
+ "\n",
+ "import validmind as vm\n",
+ "\n",
+ "# Note: You need valid API credentials for this to work\n",
+ "# If you don't have credentials, use the standalone script: test_outlier_detection_standalone.py\n",
+ "\n",
+ "vm.init(\n",
+ " api_host=\"...\",\n",
+ " api_key=\"...\",\n",
+ " api_secret=\"...\",\n",
+ " model=\"...\",\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## Import and Prepare Sample Dataset\n",
+ "\n",
+ "We'll use the Bank Customer Churn dataset as our example data for demonstrating all the tests.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from validmind.datasets.classification import customer_churn\n",
+ "\n",
+ "print(\n",
+ " f\"Loaded demo dataset with: \\n\\n\\t• Target column: '{customer_churn.target_column}' \\n\\t• Class labels: {customer_churn.class_labels}\"\n",
+ ")\n",
+ "\n",
+ "# Load and preprocess the data\n",
+ "raw_df = customer_churn.load_data()\n",
+ "train_df, validation_df, test_df = customer_churn.preprocess(raw_df)\n",
+ "\n",
+ "print(f\"\\nDataset shapes:\")\n",
+ "print(f\"• Training: {train_df.shape}\")\n",
+ "print(f\"• Validation: {validation_df.shape}\")\n",
+ "print(f\"• Test: {test_df.shape}\")\n",
+ "\n",
+ "raw_df.head()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "### Initialize ValidMind Datasets\n",
+ "\n",
+ "Initialize ValidMind dataset objects for our analysis:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Initialize datasets for ValidMind\n",
+ "vm_raw_dataset = vm.init_dataset(\n",
+ " dataset=raw_df,\n",
+ " input_id=\"raw_dataset\",\n",
+ " target_column=customer_churn.target_column,\n",
+ " class_labels=customer_churn.class_labels,\n",
+ ")\n",
+ "\n",
+ "vm_train_ds = vm.init_dataset(\n",
+ " dataset=train_df,\n",
+ " input_id=\"train_dataset\",\n",
+ " target_column=customer_churn.target_column,\n",
+ ")\n",
+ "\n",
+ "print(\"✅ ValidMind datasets initialized successfully!\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "### Explore Dataset Structure\n",
+ "\n",
+ "Let's examine our dataset to understand what columns are available for analysis:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(\"📊 Dataset Information:\")\n",
+ "print(f\"\\nAll columns ({len(vm_train_ds.df.columns)}):\")\n",
+ "print(list(vm_train_ds.df.columns))\n",
+ "\n",
+ "print(f\"\\nNumerical columns ({len(vm_train_ds.feature_columns_numeric)}):\")\n",
+ "print(vm_train_ds.feature_columns_numeric)\n",
+ "\n",
+ "print(f\"\\nCategorical columns ({len(vm_train_ds.feature_columns_categorical) if hasattr(vm_train_ds, 'feature_columns_categorical') else 0}):\")\n",
+ "print(vm_train_ds.feature_columns_categorical if hasattr(vm_train_ds, 'feature_columns_categorical') else \"None detected\")\n",
+ "\n",
+ "print(f\"\\nTarget column: {vm_train_ds.target_column}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "# Part 1: Plotting Tests\n",
+ "\n",
+ "The ValidMind plotting tests provide powerful visualization capabilities for data exploration and analysis. All plots are interactive and built with Plotly.\n",
+ "\n",
+ "## 1. Correlation Heatmap\n",
+ "\n",
+ "Visualizes correlations between numerical features using a heatmap. Useful for identifying multicollinearity and feature relationships.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Basic correlation heatmap\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.plots.CorrelationHeatmap\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"method\": \"pearson\",\n",
+ " \"show_values\": True,\n",
+ " \"colorscale\": \"RdBu\",\n",
+ " \"mask_upper\": False,\n",
+ " \"threshold\": None,\n",
+ " \"width\": 800,\n",
+ " \"height\": 600,\n",
+ " \"title\": \"Feature Correlation Heatmap\"\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Advanced correlation heatmap with custom settings\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.plots.CorrelationHeatmap\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"method\": \"spearman\", # Different correlation method\n",
+ " \"show_values\": True,\n",
+ " \"colorscale\": \"Viridis\",\n",
+ " \"mask_upper\": True, # Mask upper triangle\n",
+ " \"width\": 900,\n",
+ " \"height\": 700,\n",
+ " \"title\": \"Spearman Correlation (|r| > 0.3)\",\n",
+ " \"columns\": [\"CreditScore\", \"Age\", \"Balance\", \"EstimatedSalary\"] # Specific columns\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## 2. Histogram Plot\n",
+ "\n",
+ "Creates histogram distributions for numerical features with optional KDE overlay. Essential for understanding data distributions.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Basic histogram with KDE\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.plots.HistogramPlot\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"columns\": [\"CreditScore\", \"Balance\", \"EstimatedSalary\", \"Age\"],\n",
+ " \"bins\": 30,\n",
+ " \"color\": \"steelblue\",\n",
+ " \"opacity\": 0.7,\n",
+ " \"show_kde\": True,\n",
+ " \"normalize\": False,\n",
+ " \"log_scale\": False,\n",
+ " \"width\": 1200,\n",
+ " \"height\": 800,\n",
+ " \"n_cols\": 2,\n",
+ " \"vertical_spacing\": 0.15,\n",
+ " \"horizontal_spacing\": 0.15,\n",
+ " \"title_prefix\": \"Distribution of\"\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## 3. Box Plot\n",
+ "\n",
+ "Displays box plots for numerical features, optionally grouped by a categorical variable. Excellent for outlier detection and comparing distributions.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Box plots grouped by target variable\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.plots.BoxPlot\", \n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"columns\": [\"CreditScore\", \"Balance\", \"Age\"],\n",
+ " \"group_by\": \"Exited\", # Group by churn status\n",
+ " \"colors\": [\"lightblue\", \"salmon\"],\n",
+ " \"show_outliers\": True,\n",
+ " \"width\": 1200,\n",
+ " \"height\": 600\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## 4. Violin Plot\n",
+ "\n",
+ "Creates violin plots that combine box plots with kernel density estimation. Shows both summary statistics and distribution shape.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Violin plots grouped by target variable\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.plots.ViolinPlot\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"columns\": [\"Age\", \"Balance\"], # Focus on key variables\n",
+ " \"group_by\": \"Exited\",\n",
+ " \"width\": 800,\n",
+ " \"height\": 600\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "# Part 2: Statistical Tests\n",
+ "\n",
+ "The ValidMind statistical tests provide comprehensive statistical analysis capabilities for understanding data characteristics and quality.\n",
+ "\n",
+ "## 1. Descriptive Statistics\n",
+ "\n",
+ "Provides comprehensive descriptive statistics including basic statistics, distribution measures, confidence intervals, and normality tests.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Advanced descriptive statistics with all measures\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.stats.DescriptiveStats\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"include_advanced\": True, # Include skewness, kurtosis, normality tests, etc.\n",
+ " \"confidence_level\": 0.99, # 99% confidence intervals\n",
+ " \"columns\": [\"CreditScore\", \"Balance\", \"EstimatedSalary\", \"Age\"] # Specific columns\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## 2. Correlation Analysis\n",
+ "\n",
+ "Performs detailed correlation analysis with statistical significance testing and identifies highly correlated feature pairs.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Correlation analysis with significance testing\n",
+ "result = vm.tests.run_test(\n",
+ " \"validmind.stats.CorrelationAnalysis\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"method\": \"pearson\", # or \"spearman\", \"kendall\"\n",
+ " \"significance_level\": 0.05,\n",
+ " \"min_correlation\": 0.1 # Minimum correlation threshold\n",
+ " }\n",
+ ")\n",
+ "result.log()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## 3. Normality Tests\n",
+ "\n",
+ "Performs various normality tests to assess whether features follow a normal distribution.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Comprehensive normality testing\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.stats.NormalityTests\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"tests\": [\"shapiro\", \"anderson\", \"kstest\"], # Multiple tests\n",
+ " \"alpha\": 0.05,\n",
+ " \"columns\": [\"CreditScore\", \"Balance\", \"Age\"] # Focus on key features\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "## 4. Outlier Detection\n",
+ "\n",
+ "Identifies outliers using various statistical methods including IQR, Z-score, and Isolation Forest.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Comprehensive outlier detection with multiple methods\n",
+ "vm.tests.run_test(\n",
+ " \"validmind.stats.OutlierDetection\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\n",
+ " \"methods\": [\"iqr\", \"zscore\", \"isolation_forest\"],\n",
+ " \"iqr_threshold\": 1.5,\n",
+ " \"zscore_threshold\": 3.0,\n",
+ " \"contamination\": 0.1,\n",
+ " \"columns\": [\"CreditScore\", \"Balance\", \"EstimatedSalary\"]\n",
+ " }\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "# Part 3: Complete EDA Workflow Example\n",
+ "\n",
+ "Let's demonstrate a complete exploratory data analysis workflow using all the tests together:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Example: Complete EDA workflow using all tests\n",
+ "print(\"🔍 Complete Exploratory Data Analysis Workflow\")\n",
+ "print(\"=\" * 50)\n",
+ "\n",
+ "# 1. Start with descriptive statistics\n",
+ "print(\"\\n1. Descriptive Statistics:\")\n",
+ "desc_stats = vm.tests.run_test(\n",
+ " \"validmind.stats.DescriptiveStats\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\"include_advanced\": True}\n",
+ ")\n",
+ "\n",
+ "print(\"\\n2. Distribution Analysis:\")\n",
+ "# 2. Visualize distributions\n",
+ "hist_plot = vm.tests.run_test(\n",
+ " \"validmind.plots.HistogramPlot\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\"show_kde\": True, \"n_cols\": 3}\n",
+ ")\n",
+ "\n",
+ "print(\"\\n3. Correlation Analysis:\")\n",
+ "# 3. Check correlations\n",
+ "corr_heatmap = vm.tests.run_test(\n",
+ " \"validmind.plots.CorrelationHeatmap\",\n",
+ " inputs={\"dataset\": vm_train_ds}\n",
+ ")\n",
+ "\n",
+ "print(\"\\n4. Outlier Detection:\")\n",
+ "# 4. Detect outliers\n",
+ "outliers = vm.tests.run_test(\n",
+ " \"validmind.stats.OutlierDetection\",\n",
+ " inputs={\"dataset\": vm_train_ds},\n",
+ " params={\"methods\": [\"iqr\", \"zscore\"]}\n",
+ ")\n",
+ "\n",
+ "print(\"\\n✅ EDA Complete! Check the visualizations and tables above for insights.\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Comprehensive Guide: ValidMind Plots and Statistics Tests\n",
+ "\n",
+ "This notebook demonstrates all the available tests from the `validmind.plots` and `validmind.stats` modules. These generalized tests provide powerful visualization and statistical analysis capabilities for any dataset.\n",
+ "\n",
+ "## What You'll Learn\n",
+ "\n",
+ "In this notebook, we'll explore:\n",
+ "\n",
+ "1. **Plotting Tests**: Visual analysis tools for data exploration\n",
+ " - GeneralCorrelationHeatmap\n",
+ " - GeneralHistogramPlot\n",
+ " - GeneralBoxPlot\n",
+ " - GeneralViolinPlot\n",
+ "\n",
+ "2. **Statistical Tests**: Comprehensive statistical analysis tools\n",
+ " - GeneralDescriptiveStats\n",
+ " - GeneralCorrelationAnalysis\n",
+ " - GeneralNormalityTests\n",
+ " - GeneralOutlierDetection\n",
+ "\n",
+ "Each test is highly configurable and can be adapted to different datasets and use cases.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "vscode": {
+ "languageId": "raw"
+ }
+ },
+ "source": [
+ "# Conclusion\n",
+ "\n",
+ "This notebook demonstrated all the plotting and statistical tests available in ValidMind:\n",
+ "\n",
+ "## Plotting Tests Covered:\n",
+ "✅ **GeneralCorrelationHeatmap** - Interactive correlation matrices \n",
+ "✅ **GeneralHistogramPlot** - Distribution analysis with KDE \n",
+ "✅ **GeneralBoxPlot** - Outlier detection and group comparisons \n",
+ "✅ **GeneralViolinPlot** - Distribution shape analysis \n",
+ "\n",
+ "## Statistical Tests Covered:\n",
+ "✅ **GeneralDescriptiveStats** - Comprehensive statistical profiling \n",
+ "✅ **GeneralCorrelationAnalysis** - Formal correlation testing \n",
+ "✅ **GeneralNormalityTests** - Distribution assumption checking \n",
+ "✅ **GeneralOutlierDetection** - Multi-method outlier identification \n",
+ "\n",
+ "## Key Benefits:\n",
+ "- **Highly Customizable**: All tests offer extensive parameter options\n",
+ "- **Interactive Visualizations**: Plotly-based plots with zoom, pan, hover\n",
+ "- **Statistical Rigor**: Formal testing with significance levels\n",
+ "- **Flexible Input**: Works with any ValidMind dataset\n",
+ "- **Comprehensive Output**: Tables, plots, and statistical summaries\n",
+ "\n",
+ "## Best Practices:\n",
+ "\n",
+ "### When to Use Each Test:\n",
+ "\n",
+ "**Plotting Tests:**\n",
+ "- **GeneralCorrelationHeatmap**: Initial data exploration, multicollinearity detection\n",
+ "- **GeneralHistogramPlot**: Understanding feature distributions, identifying skewness\n",
+ "- **GeneralBoxPlot**: Outlier detection, comparing groups\n",
+ "- **GeneralViolinPlot**: Detailed distribution analysis, especially for grouped data\n",
+ "\n",
+ "**Statistical Tests:**\n",
+ "- **GeneralDescriptiveStats**: Comprehensive data profiling, baseline statistics\n",
+ "- **GeneralCorrelationAnalysis**: Formal correlation testing with significance\n",
+ "- **GeneralNormalityTests**: Model assumption checking\n",
+ "- **GeneralOutlierDetection**: Data quality assessment, preprocessing decisions\n",
+ "\n",
+ "## Next Steps:\n",
+ "- Integrate these tests into your model documentation templates\n",
+ "- Customize parameters based on your specific data characteristics\n",
+ "- Use results to inform preprocessing and modeling decisions\n",
+ "- Combine with ValidMind's model validation tests for complete analysis\n",
+ "\n",
+ "These tests provide a solid foundation for exploratory data analysis, data quality assessment, and statistical validation in any data science workflow.\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "ValidMind Library",
+ "language": "python",
+ "name": "validmind"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/validmind/tests/__init__.py b/validmind/tests/__init__.py
index 2de78d703..5112a527e 100644
--- a/validmind/tests/__init__.py
+++ b/validmind/tests/__init__.py
@@ -43,6 +43,8 @@ def register_test_provider(namespace: str, test_provider: TestProvider) -> None:
"data_validation",
"model_validation",
"prompt_validation",
+ "plots",
+ "stats",
"list_tests",
"load_test",
"describe_test",
diff --git a/validmind/tests/plots/BoxPlot.py b/validmind/tests/plots/BoxPlot.py
new file mode 100644
index 000000000..7c2861ef4
--- /dev/null
+++ b/validmind/tests/plots/BoxPlot.py
@@ -0,0 +1,260 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List, Optional
+
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.vm_models import VMDataset
+
+
+def _validate_inputs(
+ dataset: VMDataset, columns: Optional[List[str]], group_by: Optional[str]
+):
+ """Validate inputs and return validated columns."""
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ if not columns:
+ raise SkipTestError("No numerical columns found for box plotting")
+
+ if group_by is not None:
+ if group_by not in dataset.df.columns:
+ raise SkipTestError(f"Group column '{group_by}' not found in dataset")
+ if group_by in columns:
+ columns.remove(group_by)
+
+ return columns
+
+
+def _create_grouped_boxplot(
+ dataset, columns, group_by, colors, show_outliers, title_prefix, width, height
+):
+ """Create grouped box plots."""
+ fig = go.Figure()
+ groups = dataset.df[group_by].dropna().unique()
+
+ for col_idx, column in enumerate(columns):
+ for group_idx, group_value in enumerate(groups):
+ data_subset = dataset.df[dataset.df[group_by] == group_value][
+ column
+ ].dropna()
+
+ if len(data_subset) > 0:
+ color = colors[group_idx % len(colors)]
+ fig.add_trace(
+ go.Box(
+ y=data_subset,
+ name=f"{group_value}",
+ marker_color=color,
+ boxpoints="outliers" if show_outliers else False,
+ jitter=0.3,
+ pointpos=-1.8,
+ legendgroup=f"{group_value}",
+ showlegend=(col_idx == 0),
+ offsetgroup=group_idx,
+ x=[column] * len(data_subset),
+ )
+ )
+
+ fig.update_layout(
+ title=f"{title_prefix} Features by {group_by}",
+ xaxis_title="Features",
+ yaxis_title="Values",
+ boxmode="group",
+ width=width,
+ height=height,
+ template="plotly_white",
+ )
+ return fig
+
+
+def _create_single_boxplot(
+ dataset, column, colors, show_outliers, title_prefix, width, height
+):
+ """Create single column box plot."""
+ data = dataset.df[column].dropna()
+ if len(data) == 0:
+ raise SkipTestError(f"No data available for column {column}")
+
+ fig = go.Figure()
+ fig.add_trace(
+ go.Box(
+ y=data,
+ name=column,
+ marker_color=colors[0],
+ boxpoints="outliers" if show_outliers else False,
+ jitter=0.3,
+ pointpos=-1.8,
+ )
+ )
+
+ fig.update_layout(
+ title=f"{title_prefix} {column}",
+ yaxis_title=column,
+ width=width,
+ height=height,
+ template="plotly_white",
+ showlegend=False,
+ )
+ return fig
+
+
+def _create_multiple_boxplots(
+ dataset, columns, colors, show_outliers, title_prefix, width, height
+):
+ """Create multiple column box plots in subplot layout."""
+ n_cols = min(3, len(columns))
+ n_rows = (len(columns) + n_cols - 1) // n_cols
+
+ subplot_titles = [f"{title_prefix} {col}" for col in columns]
+ fig = make_subplots(
+ rows=n_rows,
+ cols=n_cols,
+ subplot_titles=subplot_titles,
+ vertical_spacing=0.1,
+ horizontal_spacing=0.1,
+ )
+
+ for idx, column in enumerate(columns):
+ row = (idx // n_cols) + 1
+ col = (idx % n_cols) + 1
+ data = dataset.df[column].dropna()
+
+ if len(data) > 0:
+ color = colors[idx % len(colors)]
+ fig.add_trace(
+ go.Box(
+ y=data,
+ name=column,
+ marker_color=color,
+ boxpoints="outliers" if show_outliers else False,
+ jitter=0.3,
+ pointpos=-1.8,
+ showlegend=False,
+ ),
+ row=row,
+ col=col,
+ )
+ fig.update_yaxes(title_text=column, row=row, col=col)
+ else:
+ fig.add_annotation(
+ text=f"No data available
for {column}",
+ x=0.5,
+ y=0.5,
+ xref=f"x{idx+1} domain" if idx > 0 else "x domain",
+ yref=f"y{idx+1} domain" if idx > 0 else "y domain",
+ showarrow=False,
+ row=row,
+ col=col,
+ )
+
+ fig.update_layout(
+ title="Dataset Feature Distributions",
+ width=width,
+ height=height,
+ template="plotly_white",
+ showlegend=False,
+ )
+ return fig
+
+
+@tags("tabular_data", "visualization", "data_quality")
+@tasks("classification", "regression", "clustering")
+def BoxPlot(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ group_by: Optional[str] = None,
+ width: int = 1200,
+ height: int = 600,
+ colors: Optional[List[str]] = None,
+ show_outliers: bool = True,
+ title_prefix: str = "Box Plot of",
+) -> go.Figure:
+ """
+ Generates customizable box plots for numerical features in a dataset with optional grouping using Plotly.
+
+ ### Purpose
+
+ This test provides a flexible way to visualize the distribution of numerical features
+ through interactive box plots, with optional grouping by categorical variables. Box plots are
+ effective for identifying outliers, comparing distributions across groups, and
+ understanding the spread and central tendency of the data.
+
+ ### Test Mechanism
+
+ The test creates interactive box plots for specified numerical columns (or all numerical columns
+ if none specified). It supports various customization options including:
+ - Grouping by categorical variables
+ - Customizable colors and styling
+ - Outlier display options
+ - Interactive hover information
+ - Zoom and pan capabilities
+
+ ### Signs of High Risk
+
+ - Presence of many outliers indicating data quality issues
+ - Highly skewed distributions
+ - Large differences in variance across groups
+ - Unexpected patterns in grouped data
+
+ ### Strengths
+
+ - Clear visualization of distribution statistics (median, quartiles, outliers)
+ - Interactive Plotly plots with hover information and zoom capabilities
+ - Effective for comparing distributions across groups
+ - Handles missing values appropriately
+ - Highly customizable appearance
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - May not be suitable for continuous variables with many unique values
+ - Visual interpretation may be subjective
+ - Less effective with very large datasets
+ """
+ # Validate inputs
+ columns = _validate_inputs(dataset, columns, group_by)
+
+ # Set default colors
+ if colors is None:
+ colors = [
+ "steelblue",
+ "orange",
+ "green",
+ "red",
+ "purple",
+ "brown",
+ "pink",
+ "gray",
+ "olive",
+ "cyan",
+ ]
+
+ # Create appropriate plot type
+ if group_by is not None:
+ return _create_grouped_boxplot(
+ dataset,
+ columns,
+ group_by,
+ colors,
+ show_outliers,
+ title_prefix,
+ width,
+ height,
+ )
+ elif len(columns) == 1:
+ return _create_single_boxplot(
+ dataset, columns[0], colors, show_outliers, title_prefix, width, height
+ )
+ else:
+ return _create_multiple_boxplots(
+ dataset, columns, colors, show_outliers, title_prefix, width, height
+ )
diff --git a/validmind/tests/plots/CorrelationHeatmap.py b/validmind/tests/plots/CorrelationHeatmap.py
new file mode 100644
index 000000000..c37bb894e
--- /dev/null
+++ b/validmind/tests/plots/CorrelationHeatmap.py
@@ -0,0 +1,235 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List, Optional
+
+import numpy as np
+import plotly.graph_objects as go
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.vm_models import VMDataset
+
+
+def _validate_and_prepare_data(
+ dataset: VMDataset, columns: Optional[List[str]], method: str
+):
+ """Validate inputs and prepare correlation data."""
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ if not columns:
+ raise SkipTestError("No numerical columns found for correlation analysis")
+
+ if len(columns) < 2:
+ raise SkipTestError(
+ "At least 2 numerical columns required for correlation analysis"
+ )
+
+ # Get data and remove constant columns
+ data = dataset.df[columns]
+ data = data.loc[:, data.var() != 0]
+
+ if data.shape[1] < 2:
+ raise SkipTestError(
+ "Insufficient non-constant columns for correlation analysis"
+ )
+
+ return data.corr(method=method)
+
+
+def _apply_filters(corr_matrix, threshold: Optional[float], mask_upper: bool):
+ """Apply threshold and masking filters to correlation matrix."""
+ if threshold is not None:
+ mask = np.abs(corr_matrix) < threshold
+ corr_matrix = corr_matrix.mask(mask)
+
+ if mask_upper:
+ mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+ corr_matrix = corr_matrix.mask(mask)
+
+ return corr_matrix
+
+
+def _create_annotation_text(z_values, y_labels, x_labels, show_values: bool):
+ """Create text annotations for heatmap cells."""
+ if not show_values:
+ return None
+
+ text = []
+ for i in range(len(y_labels)):
+ text_row = []
+ for j in range(len(x_labels)):
+ value = z_values[i][j]
+ if np.isnan(value):
+ text_row.append("")
+ else:
+ text_row.append(f"{value:.3f}")
+ text.append(text_row)
+ return text
+
+
+def _calculate_adaptive_font_size(n_features: int) -> int:
+ """Calculate adaptive font size based on number of features."""
+ if n_features <= 10:
+ return 12
+ elif n_features <= 20:
+ return 10
+ elif n_features <= 30:
+ return 8
+ else:
+ return 6
+
+
+def _calculate_stats_and_update_layout(
+ fig, corr_matrix, method: str, title: str, width: int, height: int
+):
+ """Calculate statistics and update figure layout."""
+ n_features = corr_matrix.shape[0]
+ upper_triangle = corr_matrix.values[np.triu_indices_from(corr_matrix.values, k=1)]
+ upper_triangle = upper_triangle[~np.isnan(upper_triangle)]
+
+ if len(upper_triangle) > 0:
+ mean_corr = np.abs(upper_triangle).mean()
+ max_corr = np.abs(upper_triangle).max()
+ stats_text = f"Features: {n_features}
Mean |r|: {mean_corr:.3f}
Max |r|: {max_corr:.3f}"
+ else:
+ stats_text = f"Features: {n_features}"
+
+ fig.update_layout(
+ title={
+ "text": f"{title} ({method.capitalize()} Correlation)",
+ "x": 0.5,
+ "xanchor": "center",
+ },
+ width=width,
+ height=height,
+ template="plotly_white",
+ xaxis=dict(tickangle=45, side="bottom"),
+ yaxis=dict(tickmode="linear", autorange="reversed"),
+ annotations=[
+ dict(
+ text=stats_text,
+ x=0.02,
+ y=0.98,
+ xref="paper",
+ yref="paper",
+ showarrow=False,
+ align="left",
+ bgcolor="rgba(255,255,255,0.8)",
+ bordercolor="black",
+ borderwidth=1,
+ )
+ ],
+ )
+
+
+@tags("tabular_data", "visualization", "correlation")
+@tasks("classification", "regression", "clustering")
+def CorrelationHeatmap(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ method: str = "pearson",
+ show_values: bool = True,
+ colorscale: str = "RdBu",
+ width: int = 800,
+ height: int = 600,
+ mask_upper: bool = False,
+ threshold: Optional[float] = None,
+ title: str = "Correlation Heatmap",
+) -> go.Figure:
+ """
+ Generates customizable correlation heatmap plots for numerical features in a dataset using Plotly.
+
+ ### Purpose
+
+ This test provides a flexible way to visualize correlations between numerical features
+ in a dataset using interactive Plotly heatmaps. It supports different correlation methods
+ and extensive customization options for the heatmap appearance, making it suitable for
+ exploring feature relationships in data analysis.
+
+ ### Test Mechanism
+
+ The test computes correlation coefficients between specified numerical columns
+ (or all numerical columns if none specified) using the specified method.
+ It then creates an interactive heatmap visualization with customizable appearance options including:
+ - Different correlation methods (pearson, spearman, kendall)
+ - Color schemes and annotations
+ - Masking options for upper triangle
+ - Threshold filtering for significant correlations
+ - Interactive hover information
+
+ ### Signs of High Risk
+
+ - Very high correlations (>0.9) between features indicating multicollinearity
+ - Unexpected correlation patterns that contradict domain knowledge
+ - Features with no correlation to any other variables
+ - Strong correlations with the target variable that might indicate data leakage
+
+ ### Strengths
+
+ - Supports multiple correlation methods
+ - Interactive Plotly plots with hover information and zoom capabilities
+ - Highly customizable visualization options
+ - Can handle missing values appropriately
+ - Provides clear visual representation of feature relationships
+ - Optional thresholding to focus on significant correlations
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Cannot capture non-linear relationships effectively
+ - May be difficult to interpret with many features
+ - Correlation does not imply causation
+ """
+ # Validate inputs and compute correlation
+ corr_matrix = _validate_and_prepare_data(dataset, columns, method)
+
+ # Apply filters
+ corr_matrix = _apply_filters(corr_matrix, threshold, mask_upper)
+
+ # Prepare heatmap data
+ z_values = corr_matrix.values
+ x_labels = corr_matrix.columns.tolist()
+ y_labels = corr_matrix.index.tolist()
+ text = _create_annotation_text(z_values, y_labels, x_labels, show_values)
+
+ # Calculate adaptive font size
+ n_features = len(x_labels)
+ font_size = _calculate_adaptive_font_size(n_features)
+
+ # Create heatmap
+ heatmap_kwargs = {
+ "z": z_values,
+ "x": x_labels,
+ "y": y_labels,
+ "colorscale": colorscale,
+ "zmin": -1,
+ "zmax": 1,
+ "colorbar": dict(title=f"{method.capitalize()} Correlation"),
+ "hoverongaps": False,
+ "hovertemplate": "%{y} vs %{x}
"
+ + f"{method.capitalize()} Correlation: %{{z:.3f}}
"
+ + "",
+ }
+
+ # Add text annotations if requested
+ if show_values and text is not None:
+ heatmap_kwargs.update(
+ {
+ "text": text,
+ "texttemplate": "%{text}",
+ "textfont": {"size": font_size, "color": "black"},
+ }
+ )
+
+ fig = go.Figure(data=go.Heatmap(**heatmap_kwargs))
+
+ # Update layout with stats
+ _calculate_stats_and_update_layout(fig, corr_matrix, method, title, width, height)
+
+ return fig
diff --git a/validmind/tests/plots/HistogramPlot.py b/validmind/tests/plots/HistogramPlot.py
new file mode 100644
index 000000000..b5fbbaf35
--- /dev/null
+++ b/validmind/tests/plots/HistogramPlot.py
@@ -0,0 +1,233 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List, Optional, Union
+
+import numpy as np
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+from scipy import stats
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.vm_models import VMDataset
+
+
+def _validate_columns(dataset: VMDataset, columns: Optional[List[str]]):
+ """Validate and return numerical columns."""
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ if not columns:
+ raise SkipTestError("No numerical columns found for histogram plotting")
+
+ return columns
+
+
+def _process_column_data(data, log_scale: bool, column: str):
+ """Process column data and return plot data and xlabel."""
+ plot_data = data
+ xlabel = column
+ if log_scale and (data > 0).all():
+ plot_data = np.log10(data)
+ xlabel = f"log10({column})"
+ return plot_data, xlabel
+
+
+def _add_histogram_trace(
+ fig, plot_data, bins, color, opacity, normalize, column, row, col
+):
+ """Add histogram trace to figure."""
+ histnorm = "probability density" if normalize else None
+
+ fig.add_trace(
+ go.Histogram(
+ x=plot_data,
+ nbinsx=bins if isinstance(bins, int) else None,
+ name=f"Histogram - {column}",
+ marker_color=color,
+ opacity=opacity,
+ histnorm=histnorm,
+ showlegend=False,
+ ),
+ row=row,
+ col=col,
+ )
+
+
+def _add_kde_trace(fig, plot_data, bins, normalize, column, row, col):
+ """Add KDE trace to figure if possible."""
+ try:
+ kde = stats.gaussian_kde(plot_data)
+ x_range = np.linspace(plot_data.min(), plot_data.max(), 100)
+ kde_values = kde(x_range)
+
+ if not normalize:
+ hist_max = (
+ len(plot_data) / bins if isinstance(bins, int) else len(plot_data) / 30
+ )
+ kde_values = kde_values * hist_max / kde_values.max()
+
+ fig.add_trace(
+ go.Scatter(
+ x=x_range,
+ y=kde_values,
+ mode="lines",
+ name=f"KDE - {column}",
+ line=dict(color="red", width=2),
+ showlegend=False,
+ ),
+ row=row,
+ col=col,
+ )
+ except Exception:
+ pass
+
+
+def _add_stats_annotation(fig, data, idx, row, col):
+ """Add statistics annotation to subplot."""
+ stats_text = f"Mean: {data.mean():.3f}
Std: {data.std():.3f}
N: {len(data)}"
+ fig.add_annotation(
+ text=stats_text,
+ x=0.02,
+ y=0.98,
+ xref=f"x{idx+1} domain" if idx > 0 else "x domain",
+ yref=f"y{idx+1} domain" if idx > 0 else "y domain",
+ showarrow=False,
+ align="left",
+ bgcolor="rgba(255,255,255,0.8)",
+ bordercolor="black",
+ borderwidth=1,
+ row=row,
+ col=col,
+ )
+
+
+@tags("tabular_data", "visualization", "data_quality")
+@tasks("classification", "regression", "clustering")
+def HistogramPlot(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ bins: Union[int, str, List] = 30,
+ color: str = "steelblue",
+ opacity: float = 0.7,
+ show_kde: bool = True,
+ normalize: bool = False,
+ log_scale: bool = False,
+ title_prefix: str = "Histogram of",
+ width: int = 1200,
+ height: int = 800,
+ n_cols: int = 2,
+ vertical_spacing: float = 0.15,
+ horizontal_spacing: float = 0.1,
+) -> go.Figure:
+ """
+ Generates customizable histogram plots for numerical features in a dataset using Plotly.
+
+ ### Purpose
+
+ This test provides a flexible way to visualize the distribution of numerical features in a dataset.
+ It allows for extensive customization of the histogram appearance and behavior through parameters,
+ making it suitable for various exploratory data analysis tasks.
+
+ ### Test Mechanism
+
+ The test creates histogram plots for specified numerical columns (or all numerical columns if none specified).
+ It supports various customization options including:
+ - Number of bins or bin edges
+ - Color and opacity
+ - Kernel density estimation overlay
+ - Logarithmic scaling
+ - Normalization options
+ - Configurable subplot layout (columns and spacing)
+
+ ### Signs of High Risk
+
+ - Highly skewed distributions that may indicate data quality issues
+ - Unexpected bimodal or multimodal distributions
+ - Presence of extreme outliers
+ - Empty or sparse distributions
+
+ ### Strengths
+
+ - Highly customizable visualization options
+ - Interactive Plotly plots with zoom, pan, and hover capabilities
+ - Supports both single and multiple column analysis
+ - Provides insights into data distribution patterns
+ - Can handle different data types and scales
+ - Configurable subplot layout for better visualization
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Visual interpretation may be subjective
+ - May not be suitable for high-dimensional datasets
+ - Performance may degrade with very large datasets
+ """
+ # Validate inputs
+ columns = _validate_columns(dataset, columns)
+
+ # Calculate subplot layout
+ n_cols = min(n_cols, len(columns))
+ n_rows = (len(columns) + n_cols - 1) // n_cols
+
+ # Create subplots
+ subplot_titles = [f"{title_prefix} {col}" for col in columns]
+ fig = make_subplots(
+ rows=n_rows,
+ cols=n_cols,
+ subplot_titles=subplot_titles,
+ vertical_spacing=vertical_spacing,
+ horizontal_spacing=horizontal_spacing,
+ )
+
+ for idx, column in enumerate(columns):
+ row = (idx // n_cols) + 1
+ col = (idx % n_cols) + 1
+ data = dataset.df[column].dropna()
+
+ if len(data) == 0:
+ fig.add_annotation(
+ text=f"No data available
for {column}",
+ x=0.5,
+ y=0.5,
+ xref=f"x{idx+1}" if idx > 0 else "x",
+ yref=f"y{idx+1}" if idx > 0 else "y",
+ showarrow=False,
+ row=row,
+ col=col,
+ )
+ continue
+
+ # Process data
+ plot_data, xlabel = _process_column_data(data, log_scale, column)
+
+ # Add histogram
+ _add_histogram_trace(
+ fig, plot_data, bins, color, opacity, normalize, column, row, col
+ )
+
+ # Add KDE if requested
+ if show_kde and len(data) > 1:
+ _add_kde_trace(fig, plot_data, bins, normalize, column, row, col)
+
+ # Update axes and add annotations
+ fig.update_xaxes(title_text=xlabel, row=row, col=col)
+ ylabel = "Density" if normalize else "Frequency"
+ fig.update_yaxes(title_text=ylabel, row=row, col=col)
+ _add_stats_annotation(fig, data, idx, row, col)
+
+ # Update layout
+ fig.update_layout(
+ title_text="Dataset Feature Distributions",
+ showlegend=False,
+ width=width,
+ height=height,
+ template="plotly_white",
+ )
+
+ return fig
diff --git a/validmind/tests/plots/ViolinPlot.py b/validmind/tests/plots/ViolinPlot.py
new file mode 100644
index 000000000..c05215a79
--- /dev/null
+++ b/validmind/tests/plots/ViolinPlot.py
@@ -0,0 +1,125 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import List, Optional
+
+import plotly.express as px
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.vm_models import VMDataset
+
+
+@tags("tabular_data", "visualization", "distribution")
+@tasks("classification", "regression", "clustering")
+def ViolinPlot(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ group_by: Optional[str] = None,
+ width: int = 800,
+ height: int = 600,
+) -> px.violin:
+ """
+ Generates interactive violin plots for numerical features using Plotly.
+
+ ### Purpose
+
+ This test creates violin plots to visualize the distribution of numerical features,
+ showing both the probability density and summary statistics. Violin plots combine
+ aspects of box plots and kernel density estimation for rich distribution visualization.
+
+ ### Test Mechanism
+
+ The test creates violin plots for specified numerical columns, with optional
+ grouping by categorical variables. Each violin shows the distribution shape,
+ quartiles, and median values.
+
+ ### Signs of High Risk
+
+ - Multimodal distributions that might indicate mixed populations
+ - Highly skewed distributions suggesting data quality issues
+ - Large differences in distribution shapes across groups
+ - Unusual distribution patterns that contradict domain expectations
+
+ ### Strengths
+
+ - Shows detailed distribution shape information
+ - Interactive Plotly visualization with hover details
+ - Effective for comparing distributions across groups
+ - Combines density estimation with quartile information
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Requires sufficient data points for meaningful density estimation
+ - May not be suitable for discrete variables
+ - Can be misleading with very small sample sizes
+ """
+ # Get numerical columns
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ if not columns:
+ raise SkipTestError("No numerical columns found for violin plot")
+
+ # For violin plots, we'll melt the data to long format
+ data = dataset.df[columns].dropna()
+
+ if len(data) == 0:
+ raise SkipTestError("No valid data available for violin plot")
+
+ # Melt the dataframe to long format
+ melted_data = data.melt(var_name="Feature", value_name="Value")
+
+ # Add group column if specified
+ if group_by and group_by in dataset.df.columns:
+ # Repeat group values for each feature
+ group_values = []
+ for column in columns:
+ column_data = dataset.df[[column, group_by]].dropna()
+ group_values.extend(column_data[group_by].tolist())
+
+ if len(group_values) == len(melted_data):
+ melted_data["Group"] = group_values
+ else:
+ group_by = None # Disable grouping if lengths don't match
+
+ # Create violin plot
+ if group_by and "Group" in melted_data.columns:
+ fig = px.violin(
+ melted_data,
+ x="Feature",
+ y="Value",
+ color="Group",
+ box=True,
+ title=f"Distribution of Features by {group_by}",
+ width=width,
+ height=height,
+ )
+ else:
+ fig = px.violin(
+ melted_data,
+ x="Feature",
+ y="Value",
+ box=True,
+ title="Feature Distributions",
+ width=width,
+ height=height,
+ )
+
+ # Update layout
+ fig.update_layout(
+ template="plotly_white",
+ title_x=0.5,
+ xaxis_title="Features",
+ yaxis_title="Values",
+ )
+
+ # Rotate x-axis labels for better readability
+ fig.update_xaxes(tickangle=45)
+
+ return fig
diff --git a/validmind/tests/plots/__init__.py b/validmind/tests/plots/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/validmind/tests/stats/CorrelationAnalysis.py b/validmind/tests/stats/CorrelationAnalysis.py
new file mode 100644
index 000000000..d9ae5f8ce
--- /dev/null
+++ b/validmind/tests/stats/CorrelationAnalysis.py
@@ -0,0 +1,251 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import Any, Dict, List, Optional
+
+import numpy as np
+import pandas as pd
+from scipy import stats
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.utils import format_records
+from validmind.vm_models import VMDataset
+
+
+def _validate_and_prepare_data(dataset: VMDataset, columns: Optional[List[str]]):
+ """Validate inputs and prepare data for correlation analysis."""
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ if not columns:
+ raise SkipTestError("No numerical columns found for correlation analysis")
+
+ if len(columns) < 2:
+ raise SkipTestError(
+ "At least 2 numerical columns required for correlation analysis"
+ )
+
+ # Get data and remove constant columns
+ data = dataset.df[columns].dropna()
+ data = data.loc[:, data.var() != 0]
+
+ if data.shape[1] < 2:
+ raise SkipTestError(
+ "Insufficient non-constant columns for correlation analysis"
+ )
+
+ return data
+
+
+def _compute_correlation_matrices(data, method: str):
+ """Compute correlation and p-value matrices based on method."""
+ if method == "pearson":
+ return _compute_pearson_with_pvalues(data)
+ elif method == "spearman":
+ return _compute_spearman_with_pvalues(data)
+ elif method == "kendall":
+ return _compute_kendall_with_pvalues(data)
+ else:
+ raise ValueError(f"Unsupported correlation method: {method}")
+
+
+def _create_correlation_pairs(
+ corr_matrix, p_matrix, significance_level: float, min_correlation: float
+):
+ """Create correlation pairs table."""
+ correlation_pairs = []
+
+ for i, col1 in enumerate(corr_matrix.columns):
+ for j, col2 in enumerate(corr_matrix.columns):
+ if i < j: # Only upper triangle to avoid duplicates
+ corr_val = corr_matrix.iloc[i, j]
+ p_val = p_matrix.iloc[i, j]
+
+ if abs(corr_val) >= min_correlation:
+ pair_info = {
+ "Feature 1": col1,
+ "Feature 2": col2,
+ "Correlation": corr_val,
+ "Abs Correlation": abs(corr_val),
+ "p-value": p_val,
+ "Significant": "Yes" if p_val < significance_level else "No",
+ "Strength": _correlation_strength(abs(corr_val)),
+ "Direction": "Positive" if corr_val > 0 else "Negative",
+ }
+ correlation_pairs.append(pair_info)
+
+ # Sort by absolute correlation value
+ correlation_pairs.sort(key=lambda x: x["Abs Correlation"], reverse=True)
+ return correlation_pairs
+
+
+def _create_summary_statistics(corr_matrix, correlation_pairs):
+ """Create summary statistics table."""
+ all_correlations = []
+ for i in range(len(corr_matrix.columns)):
+ for j in range(i + 1, len(corr_matrix.columns)):
+ all_correlations.append(abs(corr_matrix.iloc[i, j]))
+
+ significant_count = sum(
+ 1 for pair in correlation_pairs if pair["Significant"] == "Yes"
+ )
+ high_corr_count = sum(
+ 1 for pair in correlation_pairs if pair["Abs Correlation"] > 0.7
+ )
+ very_high_corr_count = sum(
+ 1 for pair in correlation_pairs if pair["Abs Correlation"] > 0.9
+ )
+
+ return {
+ "Total Feature Pairs": len(all_correlations),
+ "Pairs Above Threshold": len(correlation_pairs),
+ "Significant Correlations": significant_count,
+ "High Correlations (>0.7)": high_corr_count,
+ "Very High Correlations (>0.9)": very_high_corr_count,
+ "Mean Absolute Correlation": np.mean(all_correlations),
+ "Max Absolute Correlation": np.max(all_correlations),
+ "Median Absolute Correlation": np.median(all_correlations),
+ }
+
+
+@tags("tabular_data", "statistics", "correlation")
+@tasks("classification", "regression", "clustering")
+def CorrelationAnalysis(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ method: str = "pearson",
+ significance_level: float = 0.05,
+ min_correlation: float = 0.1,
+) -> Dict[str, Any]:
+ """
+ Performs comprehensive correlation analysis with significance testing for numerical features.
+
+ ### Purpose
+
+ This test conducts detailed correlation analysis between numerical features, including
+ correlation coefficients, significance testing, and identification of significant
+ relationships. It helps identify multicollinearity, feature relationships, and
+ potential redundancies in the dataset.
+
+ ### Test Mechanism
+
+ The test computes correlation coefficients using the specified method and performs
+ statistical significance testing for each correlation pair. It provides:
+ - Correlation matrix with significance indicators
+ - List of significant correlations above threshold
+ - Summary statistics about correlation patterns
+ - Identification of highly correlated feature pairs
+
+ ### Signs of High Risk
+
+ - Very high correlations (>0.9) indicating potential multicollinearity
+ - Many significant correlations suggesting complex feature interactions
+ - Features with no significant correlations to others (potential isolation)
+ - Unexpected correlation patterns contradicting domain knowledge
+
+ ### Strengths
+
+ - Provides statistical significance testing for correlations
+ - Supports multiple correlation methods (Pearson, Spearman, Kendall)
+ - Identifies potentially problematic high correlations
+ - Filters results by minimum correlation threshold
+ - Comprehensive summary of correlation patterns
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Cannot detect non-linear relationships (except with Spearman)
+ - Significance testing assumes certain distributional properties
+ - Correlation does not imply causation
+ """
+ # Validate and prepare data
+ data = _validate_and_prepare_data(dataset, columns)
+
+ # Compute correlation matrices
+ corr_matrix, p_matrix = _compute_correlation_matrices(data, method)
+
+ # Create correlation pairs
+ correlation_pairs = _create_correlation_pairs(
+ corr_matrix, p_matrix, significance_level, min_correlation
+ )
+
+ # Build results
+ results = {}
+ if correlation_pairs:
+ results["Correlation Pairs"] = format_records(pd.DataFrame(correlation_pairs))
+
+ # Create summary statistics
+ summary_stats = _create_summary_statistics(corr_matrix, correlation_pairs)
+ results["Summary Statistics"] = format_records(pd.DataFrame([summary_stats]))
+
+ return results
+
+
+def _compute_pearson_with_pvalues(data):
+ """Compute Pearson correlation with p-values"""
+ n_vars = data.shape[1]
+ corr_matrix = data.corr(method="pearson")
+ p_matrix = pd.DataFrame(
+ np.zeros((n_vars, n_vars)), index=corr_matrix.index, columns=corr_matrix.columns
+ )
+
+ for i, col1 in enumerate(data.columns):
+ for j, col2 in enumerate(data.columns):
+ if i != j:
+ _, p_val = stats.pearsonr(data[col1], data[col2])
+ p_matrix.iloc[i, j] = p_val
+
+ return corr_matrix, p_matrix
+
+
+def _compute_spearman_with_pvalues(data):
+ """Compute Spearman correlation with p-values"""
+ n_vars = data.shape[1]
+ corr_matrix = data.corr(method="spearman")
+ p_matrix = pd.DataFrame(
+ np.zeros((n_vars, n_vars)), index=corr_matrix.index, columns=corr_matrix.columns
+ )
+
+ for i, col1 in enumerate(data.columns):
+ for j, col2 in enumerate(data.columns):
+ if i != j:
+ _, p_val = stats.spearmanr(data[col1], data[col2])
+ p_matrix.iloc[i, j] = p_val
+
+ return corr_matrix, p_matrix
+
+
+def _compute_kendall_with_pvalues(data):
+ """Compute Kendall correlation with p-values"""
+ n_vars = data.shape[1]
+ corr_matrix = data.corr(method="kendall")
+ p_matrix = pd.DataFrame(
+ np.zeros((n_vars, n_vars)), index=corr_matrix.index, columns=corr_matrix.columns
+ )
+
+ for i, col1 in enumerate(data.columns):
+ for j, col2 in enumerate(data.columns):
+ if i != j:
+ _, p_val = stats.kendalltau(data[col1], data[col2])
+ p_matrix.iloc[i, j] = p_val
+
+ return corr_matrix, p_matrix
+
+
+def _correlation_strength(abs_corr):
+ """Classify correlation strength"""
+ if abs_corr >= 0.9:
+ return "Very Strong"
+ elif abs_corr >= 0.7:
+ return "Strong"
+ elif abs_corr >= 0.5:
+ return "Moderate"
+ elif abs_corr >= 0.3:
+ return "Weak"
+ else:
+ return "Very Weak"
diff --git a/validmind/tests/stats/DescriptiveStats.py b/validmind/tests/stats/DescriptiveStats.py
new file mode 100644
index 000000000..a36e61536
--- /dev/null
+++ b/validmind/tests/stats/DescriptiveStats.py
@@ -0,0 +1,197 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import Any, Dict, List, Optional
+
+import numpy as np
+import pandas as pd
+from scipy import stats
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.utils import format_records
+from validmind.vm_models import VMDataset
+
+
+def _validate_columns(dataset: VMDataset, columns: Optional[List[str]]):
+ """Validate and return numerical columns (excluding boolean columns)."""
+ if columns is None:
+ # Get all columns marked as numeric
+ numeric_columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ numeric_columns = [col for col in columns if col in available_columns]
+
+ # Filter out boolean columns as they can't have proper statistical measures computed
+ columns = []
+ for col in numeric_columns:
+ dtype = dataset.df[col].dtype
+ # Only include integer and float types, exclude boolean
+ if pd.api.types.is_integer_dtype(dtype) or pd.api.types.is_float_dtype(dtype):
+ columns.append(col)
+
+ if not columns:
+ raise SkipTestError(
+ "No numerical columns (integer/float) found for descriptive statistics"
+ )
+
+ return columns
+
+
+def _compute_basic_stats(column: str, data, total_count: int):
+ """Compute basic statistics for a column."""
+ return {
+ "Feature": column,
+ "Count": len(data),
+ "Missing": total_count - len(data),
+ "Missing %": ((total_count - len(data)) / total_count) * 100,
+ "Mean": data.mean(),
+ "Median": data.median(),
+ "Std": data.std(),
+ "Min": data.min(),
+ "Max": data.max(),
+ "Q1": data.quantile(0.25),
+ "Q3": data.quantile(0.75),
+ "IQR": data.quantile(0.75) - data.quantile(0.25),
+ }
+
+
+def _compute_advanced_stats(column: str, data, confidence_level: float):
+ """Compute advanced statistics for a column."""
+ try:
+ # Distribution measures
+ skewness = stats.skew(data)
+ kurtosis_val = stats.kurtosis(data)
+ cv = (data.std() / data.mean()) * 100 if data.mean() != 0 else np.nan
+
+ # Confidence interval for mean
+ ci_lower, ci_upper = stats.t.interval(
+ confidence_level,
+ len(data) - 1,
+ loc=data.mean(),
+ scale=data.std() / np.sqrt(len(data)),
+ )
+
+ # Normality test
+ if len(data) <= 5000:
+ normality_stat, normality_p = stats.shapiro(data)
+ normality_test = "Shapiro-Wilk"
+ else:
+ ad_result = stats.anderson(data, dist="norm")
+ normality_stat = ad_result.statistic
+ normality_p = 0.05 if normality_stat > ad_result.critical_values[2] else 0.1
+ normality_test = "Anderson-Darling"
+
+ # Outlier detection using IQR method
+ iqr = data.quantile(0.75) - data.quantile(0.25)
+ lower_bound = data.quantile(0.25) - 1.5 * iqr
+ upper_bound = data.quantile(0.75) + 1.5 * iqr
+ outliers = data[(data < lower_bound) | (data > upper_bound)]
+ outlier_count = len(outliers)
+ outlier_pct = (outlier_count / len(data)) * 100
+
+ return {
+ "Feature": column,
+ "Skewness": skewness,
+ "Kurtosis": kurtosis_val,
+ "CV %": cv,
+ f"CI Lower ({confidence_level*100:.0f}%)": ci_lower,
+ f"CI Upper ({confidence_level*100:.0f}%)": ci_upper,
+ "Normality Test": normality_test,
+ "Normality Stat": normality_stat,
+ "Normality p-value": normality_p,
+ "Normal Distribution": "Yes" if normality_p > 0.05 else "No",
+ "Outliers (IQR)": outlier_count,
+ "Outliers %": outlier_pct,
+ }
+ except Exception:
+ return None
+
+
+@tags("tabular_data", "statistics", "data_quality")
+@tasks("classification", "regression", "clustering")
+def DescriptiveStats(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ include_advanced: bool = True,
+ confidence_level: float = 0.95,
+) -> Dict[str, Any]:
+ """
+ Provides comprehensive descriptive statistics for numerical features in a dataset.
+
+ ### Purpose
+
+ This test generates detailed descriptive statistics for numerical features, including
+ basic statistics, distribution measures, confidence intervals, and normality tests.
+ It provides a comprehensive overview of data characteristics essential for
+ understanding data quality and distribution properties.
+
+ ### Test Mechanism
+
+ The test computes various statistical measures for each numerical column:
+ - Basic statistics: count, mean, median, std, min, max, quartiles
+ - Distribution measures: skewness, kurtosis, coefficient of variation
+ - Confidence intervals for the mean
+ - Normality tests (Shapiro-Wilk for small samples, Anderson-Darling for larger)
+ - Missing value analysis
+
+ ### Signs of High Risk
+
+ - High skewness or kurtosis indicating non-normal distributions
+ - Large coefficients of variation suggesting high data variability
+ - Significant results in normality tests when normality is expected
+ - High percentage of missing values
+ - Extreme outliers based on IQR analysis
+
+ ### Strengths
+
+ - Comprehensive statistical analysis in a single test
+ - Includes advanced statistical measures beyond basic descriptives
+ - Provides confidence intervals for uncertainty quantification
+ - Handles missing values appropriately
+ - Suitable for both exploratory and confirmatory analysis
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Normality tests may not be meaningful for all data types
+ - Large datasets may make some tests computationally expensive
+ - Interpretation requires statistical knowledge
+ """
+ # Validate inputs
+ columns = _validate_columns(dataset, columns)
+
+ # Compute statistics
+ basic_stats = []
+ advanced_stats = []
+
+ for column in columns:
+ data = dataset.df[column].dropna()
+ total_count = len(dataset.df[column])
+
+ if len(data) == 0:
+ continue
+
+ # Basic statistics
+ basic_row = _compute_basic_stats(column, data, total_count)
+ basic_stats.append(basic_row)
+
+ # Advanced statistics
+ if include_advanced and len(data) > 2:
+ advanced_row = _compute_advanced_stats(column, data, confidence_level)
+ if advanced_row is not None:
+ advanced_stats.append(advanced_row)
+
+ # Format results
+ results = {}
+ if basic_stats:
+ results["Basic Statistics"] = format_records(pd.DataFrame(basic_stats))
+
+ if advanced_stats and include_advanced:
+ results["Advanced Statistics"] = format_records(pd.DataFrame(advanced_stats))
+
+ if not results:
+ raise SkipTestError("Unable to compute statistics for any columns")
+
+ return results
diff --git a/validmind/tests/stats/NormalityTests.py b/validmind/tests/stats/NormalityTests.py
new file mode 100644
index 000000000..060aa1cd4
--- /dev/null
+++ b/validmind/tests/stats/NormalityTests.py
@@ -0,0 +1,147 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import Any, Dict, List, Optional
+
+import pandas as pd
+from scipy import stats
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.utils import format_records
+from validmind.vm_models import VMDataset
+
+
+def _validate_columns(dataset: VMDataset, columns: Optional[List[str]]):
+ """Validate and return numerical columns."""
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ if not columns:
+ raise SkipTestError("No numerical columns found for normality testing")
+
+ return columns
+
+
+def _run_shapiro_test(data, tests: List[str], alpha: float):
+ """Run Shapiro-Wilk test if requested and data size is appropriate."""
+ results = {}
+ if "shapiro" in tests and len(data) <= 5000:
+ try:
+ stat, p_value = stats.shapiro(data)
+ results["Shapiro-Wilk Stat"] = stat
+ results["Shapiro-Wilk p-value"] = p_value
+ results["Shapiro-Wilk Normal"] = "Yes" if p_value > alpha else "No"
+ except Exception:
+ results["Shapiro-Wilk Normal"] = "Test Failed"
+ return results
+
+
+def _run_anderson_test(data, tests: List[str]):
+ """Run Anderson-Darling test if requested."""
+ results = {}
+ if "anderson" in tests:
+ try:
+ ad_result = stats.anderson(data, dist="norm")
+ critical_value = ad_result.critical_values[2] # 5% level
+ results["Anderson-Darling Stat"] = ad_result.statistic
+ results["Anderson-Darling Critical"] = critical_value
+ results["Anderson-Darling Normal"] = (
+ "Yes" if ad_result.statistic < critical_value else "No"
+ )
+ except Exception:
+ results["Anderson-Darling Normal"] = "Test Failed"
+ return results
+
+
+def _run_ks_test(data, tests: List[str], alpha: float):
+ """Run Kolmogorov-Smirnov test if requested."""
+ results = {}
+ if "kstest" in tests:
+ try:
+ standardized = (data - data.mean()) / data.std()
+ stat, p_value = stats.kstest(standardized, "norm")
+ results["KS Test Stat"] = stat
+ results["KS Test p-value"] = p_value
+ results["KS Test Normal"] = "Yes" if p_value > alpha else "No"
+ except Exception:
+ results["KS Test Normal"] = "Test Failed"
+ return results
+
+
+def _process_column_tests(column: str, data, tests: List[str], alpha: float):
+ """Process all normality tests for a single column."""
+ result_row = {"Feature": column, "Sample Size": len(data)}
+
+ # Run individual tests
+ result_row.update(_run_shapiro_test(data, tests, alpha))
+ result_row.update(_run_anderson_test(data, tests))
+ result_row.update(_run_ks_test(data, tests, alpha))
+
+ return result_row
+
+
+@tags("tabular_data", "statistics", "normality")
+@tasks("classification", "regression", "clustering")
+def NormalityTests(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ alpha: float = 0.05,
+ tests: List[str] = ["shapiro", "anderson", "kstest"],
+) -> Dict[str, Any]:
+ """
+ Performs multiple normality tests on numerical features to assess distribution normality.
+
+ ### Purpose
+
+ This test evaluates whether numerical features follow a normal distribution using
+ various statistical tests. Understanding distribution normality is crucial for
+ selecting appropriate statistical methods and model assumptions.
+
+ ### Test Mechanism
+
+ The test applies multiple normality tests:
+ - Shapiro-Wilk test: Best for small to medium samples
+ - Anderson-Darling test: More sensitive to deviations in tails
+ - Kolmogorov-Smirnov test: General goodness-of-fit test
+
+ ### Signs of High Risk
+
+ - Multiple normality tests failing consistently
+ - Very low p-values indicating strong evidence against normality
+ - Conflicting results between different normality tests
+
+ ### Strengths
+
+ - Multiple statistical tests for robust assessment
+ - Clear pass/fail indicators for each test
+ - Suitable for different sample sizes
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Some tests sensitive to sample size
+ - Perfect normality is rare in real data
+ """
+ # Validate inputs
+ columns = _validate_columns(dataset, columns)
+
+ # Process each column
+ normality_results = []
+ for column in columns:
+ data = dataset.df[column].dropna()
+
+ if len(data) >= 3:
+ result_row = _process_column_tests(column, data, tests, alpha)
+ normality_results.append(result_row)
+
+ # Format results
+ results = {}
+ if normality_results:
+ results["Normality Tests"] = format_records(pd.DataFrame(normality_results))
+
+ return results
diff --git a/validmind/tests/stats/OutlierDetection.py b/validmind/tests/stats/OutlierDetection.py
new file mode 100644
index 000000000..48b7c2b6e
--- /dev/null
+++ b/validmind/tests/stats/OutlierDetection.py
@@ -0,0 +1,173 @@
+# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
+# See the LICENSE file in the root of this repository for details.
+# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+
+from typing import Any, Dict, List, Optional
+
+import numpy as np
+import pandas as pd
+from scipy import stats
+from sklearn.ensemble import IsolationForest
+
+from validmind import tags, tasks
+from validmind.errors import SkipTestError
+from validmind.utils import format_records
+from validmind.vm_models import VMDataset
+
+
+def _validate_columns(dataset: VMDataset, columns: Optional[List[str]]):
+ """Validate and return numerical columns."""
+ if columns is None:
+ columns = dataset.feature_columns_numeric
+ else:
+ available_columns = set(dataset.feature_columns_numeric)
+ columns = [col for col in columns if col in available_columns]
+
+ # Filter out boolean columns as they can't be used for outlier detection
+ numeric_columns = []
+ for col in columns:
+ if col in dataset.df.columns:
+ col_dtype = dataset.df[col].dtype
+ # Exclude boolean and object types, keep only true numeric types
+ if pd.api.types.is_numeric_dtype(col_dtype) and col_dtype != bool:
+ numeric_columns.append(col)
+
+ columns = numeric_columns
+
+ if not columns:
+ raise SkipTestError("No suitable numerical columns found for outlier detection")
+
+ return columns
+
+
+def _detect_iqr_outliers(data, iqr_threshold: float):
+ """Detect outliers using IQR method."""
+ q1, q3 = data.quantile(0.25), data.quantile(0.75)
+ iqr = q3 - q1
+ lower_bound = q1 - iqr_threshold * iqr
+ upper_bound = q3 + iqr_threshold * iqr
+ # Fix numpy boolean operation error by using pandas boolean indexing properly
+ outlier_mask = (data < lower_bound) | (data > upper_bound)
+ iqr_outliers = data[outlier_mask]
+ return len(iqr_outliers), (len(iqr_outliers) / len(data)) * 100
+
+
+def _detect_zscore_outliers(data, zscore_threshold: float):
+ """Detect outliers using Z-score method."""
+ z_scores = np.abs(stats.zscore(data))
+ # Fix potential numpy boolean operation error
+ outlier_mask = z_scores > zscore_threshold
+ zscore_outliers = data[outlier_mask]
+ return len(zscore_outliers), (len(zscore_outliers) / len(data)) * 100
+
+
+def _detect_isolation_forest_outliers(data, contamination: float):
+ """Detect outliers using Isolation Forest method."""
+ if len(data) <= 10:
+ return 0, 0
+
+ try:
+ iso_forest = IsolationForest(contamination=contamination, random_state=42)
+ outlier_pred = iso_forest.fit_predict(data.values.reshape(-1, 1))
+ iso_outliers = data[outlier_pred == -1]
+ return len(iso_outliers), (len(iso_outliers) / len(data)) * 100
+ except Exception:
+ return 0, 0
+
+
+def _process_column_outliers(
+ column: str,
+ data,
+ methods: List[str],
+ iqr_threshold: float,
+ zscore_threshold: float,
+ contamination: float,
+):
+ """Process outlier detection for a single column."""
+ outliers_dict = {"Feature": column, "Total Count": len(data)}
+
+ # IQR method
+ if "iqr" in methods:
+ count, percentage = _detect_iqr_outliers(data, iqr_threshold)
+ outliers_dict["IQR Outliers"] = count
+ outliers_dict["IQR %"] = percentage
+
+ # Z-score method
+ if "zscore" in methods:
+ count, percentage = _detect_zscore_outliers(data, zscore_threshold)
+ outliers_dict["Z-Score Outliers"] = count
+ outliers_dict["Z-Score %"] = percentage
+
+ # Isolation Forest method
+ if "isolation_forest" in methods:
+ count, percentage = _detect_isolation_forest_outliers(data, contamination)
+ outliers_dict["Isolation Forest Outliers"] = count
+ outliers_dict["Isolation Forest %"] = percentage
+
+ return outliers_dict
+
+
+@tags("tabular_data", "statistics", "outliers")
+@tasks("classification", "regression", "clustering")
+def OutlierDetection(
+ dataset: VMDataset,
+ columns: Optional[List[str]] = None,
+ methods: List[str] = ["iqr", "zscore", "isolation_forest"],
+ iqr_threshold: float = 1.5,
+ zscore_threshold: float = 3.0,
+ contamination: float = 0.1,
+) -> Dict[str, Any]:
+ """
+ Detects outliers in numerical features using multiple statistical methods.
+
+ ### Purpose
+
+ This test identifies outliers in numerical features using various statistical
+ methods including IQR, Z-score, and Isolation Forest. It provides comprehensive
+ outlier detection to help identify data quality issues and potential anomalies.
+
+ ### Test Mechanism
+
+ The test applies multiple outlier detection methods:
+ - IQR method: Values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR
+ - Z-score method: Values with |z-score| > threshold
+ - Isolation Forest: ML-based anomaly detection
+
+ ### Signs of High Risk
+
+ - High percentage of outliers indicating data quality issues
+ - Inconsistent outlier detection across methods
+ - Extreme outliers that significantly deviate from normal patterns
+
+ ### Strengths
+
+ - Multiple detection methods for robust outlier identification
+ - Customizable thresholds for different sensitivity levels
+ - Clear summary of outlier patterns across features
+
+ ### Limitations
+
+ - Limited to numerical features only
+ - Some methods assume normal distributions
+ - Threshold selection can be subjective
+ """
+ # Validate inputs
+ columns = _validate_columns(dataset, columns)
+
+ # Process each column
+ outlier_summary = []
+ for column in columns:
+ data = dataset._df[column].dropna()
+
+ if len(data) >= 3:
+ outliers_dict = _process_column_outliers(
+ column, data, methods, iqr_threshold, zscore_threshold, contamination
+ )
+ outlier_summary.append(outliers_dict)
+
+ # Format results
+ results = {}
+ if outlier_summary:
+ results["Outlier Summary"] = format_records(pd.DataFrame(outlier_summary))
+
+ return results
diff --git a/validmind/tests/stats/__init__.py b/validmind/tests/stats/__init__.py
new file mode 100644
index 000000000..e69de29bb