Add comprehensive statistical testing utilities by dgenio · Pull Request #17 · dgenio/skdr-eval

dgenio · 2025-09-14T10:18:10Z

Add t_test, mann_whitney_u_test for comparing samples
Add chi_square_test for goodness of fit and independence
Add kolmogorov_smirnov_test for distribution testing
Add bootstrap_confidence_interval for robust CI estimation
Add permutation_test for non-parametric testing
Add multiple_comparison_correction for multiple testing
Add power_analysis and sample_size_calculation for study design
Include comprehensive error handling and validation
Add StatisticalTest dataclass for structured results
Update init.py to export statistical functions
Add comprehensive test suite for all statistical functions

Pull Request

📋 Description

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🧹 Code refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔧 Build/CI improvements

🔗 Related Issues

Fixes #
Related to #

🧪 Testing

Test Coverage

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have added integration tests if applicable

Manual Testing

Tested locally with make check
Tested with example scripts
Tested edge cases

Test commands run:

# List the commands you used to test
make check
python examples/quickstart.py

📝 Changes Made

Code Changes

API Changes

No API changes
Backward compatible API additions
Breaking API changes (requires major version bump)

API changes:

📚 Documentation

I have updated the documentation accordingly
I have updated docstrings for new/modified functions
I have added examples if applicable
I have updated the CHANGELOG.md

✅ Checklist

Code Quality

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
I have run ruff check and ruff format
I have run mypy type checking

Testing & CI

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
All CI checks pass
Code coverage is maintained or improved

Documentation & Communication

I have made corresponding changes to the documentation
My commit messages follow the conventional commit format
I have updated the CHANGELOG.md if applicable

🔍 Review Notes

Focus Areas

Questions for Reviewers

📸 Screenshots/Examples

# Example usage of new feature
import skdr_eval

# Your example here

🚀 Deployment Notes

No special deployment considerations
Requires database migrations
Requires environment variable changes
Requires dependency updates

Additional Context:

- Add t_test, mann_whitney_u_test for comparing samples - Add chi_square_test for goodness of fit and independence - Add kolmogorov_smirnov_test for distribution testing - Add bootstrap_confidence_interval for robust CI estimation - Add permutation_test for non-parametric testing - Add multiple_comparison_correction for multiple testing - Add power_analysis and sample_size_calculation for study design - Include comprehensive error handling and validation - Add StatisticalTest dataclass for structured results - Update __init__.py to export statistical functions - Add comprehensive test suite for all statistical functions

Copilot

Pull Request Overview

This PR adds comprehensive statistical testing utilities to the skdr-eval library to enhance its analytical capabilities for offline policy evaluation. The changes include a wide range of statistical tests and utilities commonly needed in experimental design and hypothesis testing.

Implementation of parametric (t-test) and non-parametric (Mann-Whitney U) tests for sample comparison
Addition of goodness-of-fit tests (chi-square, Kolmogorov-Smirnov) for distribution testing
Bootstrap and permutation testing methods for robust statistical inference
Power analysis and sample size calculation utilities for experimental planning
Multiple comparison correction methods and comprehensive error handling

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
src/skdr_eval/statistical.py	Main implementation of statistical functions with comprehensive error handling and structured results
src/skdr_eval/init.py	Export statistical functions and exception classes to the public API
tests/test_statistical.py	Comprehensive test suite covering all statistical functions with edge cases and error conditions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-14T10:19:53Z

src/skdr_eval/statistical.py

+    se = np.sqrt(np.var(sample1, ddof=1) / len(sample1) + np.var(sample2, ddof=1) / len(sample2))
+    if equal_var:
+        se = se * np.sqrt(1 / len(sample1) + 1 / len(sample2))


The standard error calculation is incorrect when equal_var=True. Line 95 modifies the standard error computed on line 93, but for equal variance t-tests, the pooled standard error should be calculated differently using the pooled variance formula.

Suggested change

se = np.sqrt(np.var(sample1, ddof=1) / len(sample1) + np.var(sample2, ddof=1) / len(sample2))

if equal_var:

se = se * np.sqrt(1 / len(sample1) + 1 / len(sample2))

n1, n2 = len(sample1), len(sample2)

var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)

if equal_var:

pooled_var = (((n1 - 1) * var1) + ((n2 - 1) * var2)) / (n1 + n2 - 2)

se = np.sqrt(pooled_var * (1/n1 + 1/n2))

else:

se = np.sqrt(var1 / n1 + var2 / n2)

Copilot · 2025-09-14T10:19:54Z

src/skdr_eval/statistical.py

+
+    # Calculate effect size (Cramér's V)
+    n = np.sum(observed)
+    effect_size = np.sqrt(chi2_stat / (n * (min(observed.shape) - 1)))


The Cramér's V calculation is incorrect for 1D arrays. observed.shape for a 1D array returns a tuple like (5,), so min(observed.shape) returns the length of the array, not the number of dimensions. For goodness-of-fit tests with 1D data, the formula should use len(observed) - 1 directly.

Suggested change

effect_size = np.sqrt(chi2_stat / (n * (min(observed.shape) - 1)))

effect_size = np.sqrt(chi2_stat / (n * (len(observed) - 1)))

Copilot · 2025-09-14T10:19:54Z

src/skdr_eval/__init__.py

+    "assess_propensity_calibration",
+    "assess_propensity_discrimination",