Add comprehensive statistical testing utilities#17
Conversation
- Add t_test, mann_whitney_u_test for comparing samples - Add chi_square_test for goodness of fit and independence - Add kolmogorov_smirnov_test for distribution testing - Add bootstrap_confidence_interval for robust CI estimation - Add permutation_test for non-parametric testing - Add multiple_comparison_correction for multiple testing - Add power_analysis and sample_size_calculation for study design - Include comprehensive error handling and validation - Add StatisticalTest dataclass for structured results - Update __init__.py to export statistical functions - Add comprehensive test suite for all statistical functions
There was a problem hiding this comment.
Pull Request Overview
This PR adds comprehensive statistical testing utilities to the skdr-eval library to enhance its analytical capabilities for offline policy evaluation. The changes include a wide range of statistical tests and utilities commonly needed in experimental design and hypothesis testing.
- Implementation of parametric (t-test) and non-parametric (Mann-Whitney U) tests for sample comparison
- Addition of goodness-of-fit tests (chi-square, Kolmogorov-Smirnov) for distribution testing
- Bootstrap and permutation testing methods for robust statistical inference
- Power analysis and sample size calculation utilities for experimental planning
- Multiple comparison correction methods and comprehensive error handling
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| src/skdr_eval/statistical.py | Main implementation of statistical functions with comprehensive error handling and structured results |
| src/skdr_eval/init.py | Export statistical functions and exception classes to the public API |
| tests/test_statistical.py | Comprehensive test suite covering all statistical functions with edge cases and error conditions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| se = np.sqrt(np.var(sample1, ddof=1) / len(sample1) + np.var(sample2, ddof=1) / len(sample2)) | ||
| if equal_var: | ||
| se = se * np.sqrt(1 / len(sample1) + 1 / len(sample2)) |
There was a problem hiding this comment.
The standard error calculation is incorrect when equal_var=True. Line 95 modifies the standard error computed on line 93, but for equal variance t-tests, the pooled standard error should be calculated differently using the pooled variance formula.
| se = np.sqrt(np.var(sample1, ddof=1) / len(sample1) + np.var(sample2, ddof=1) / len(sample2)) | |
| if equal_var: | |
| se = se * np.sqrt(1 / len(sample1) + 1 / len(sample2)) | |
| n1, n2 = len(sample1), len(sample2) | |
| var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1) | |
| if equal_var: | |
| pooled_var = (((n1 - 1) * var1) + ((n2 - 1) * var2)) / (n1 + n2 - 2) | |
| se = np.sqrt(pooled_var * (1/n1 + 1/n2)) | |
| else: | |
| se = np.sqrt(var1 / n1 + var2 / n2) |
|
|
||
| # Calculate effect size (Cramér's V) | ||
| n = np.sum(observed) | ||
| effect_size = np.sqrt(chi2_stat / (n * (min(observed.shape) - 1))) |
There was a problem hiding this comment.
The Cramér's V calculation is incorrect for 1D arrays. observed.shape for a 1D array returns a tuple like (5,), so min(observed.shape) returns the length of the array, not the number of dimensions. For goodness-of-fit tests with 1D data, the formula should use len(observed) - 1 directly.
| effect_size = np.sqrt(chi2_stat / (n * (min(observed.shape) - 1))) | |
| effect_size = np.sqrt(chi2_stat / (n * (len(observed) - 1))) |
| "assess_propensity_calibration", | ||
| "assess_propensity_discrimination", |
There was a problem hiding this comment.
The all list includes many functions and classes that are not imported in the file. These exports will cause ImportError when users try to import them, as they are not available in the module namespace.
| "check_propensity_balance", | ||
| "check_propensity_overlap", |
There was a problem hiding this comment.
The all list includes many functions and classes that are not imported in the file. These exports will cause ImportError when users try to import them, as they are not available in the module namespace.
| "comprehensive_propensity_diagnostics", | ||
| "compute_balance_statistics", | ||
| "compute_propensity_log_loss", | ||
| "compute_propensity_statistics", |
There was a problem hiding this comment.
The all list includes many functions and classes that are not imported in the file. These exports will cause ImportError when users try to import them, as they are not available in the module namespace.
| "compute_propensity_statistics", | ||
| "dr_value_with_clip", | ||
| "evaluate_pairwise_models", | ||
| "evaluate_propensity_diagnostics", |
There was a problem hiding this comment.
The all list includes many functions and classes that are not imported in the file. These exports will cause ImportError when users try to import them, as they are not available in the module namespace.
| "evaluate_sklearn_models", | ||
| "fit_outcome_crossfit", | ||
| "fit_propensity_timecal", | ||
| "generate_propensity_report", |
There was a problem hiding this comment.
The all list includes many functions and classes that are not imported in the file. These exports will cause ImportError when users try to import them, as they are not available in the module namespace.
| "PropensityDiagnostics", | ||
| "BootstrapError", | ||
| "ConfigurationError", | ||
| "ConvergenceError", | ||
| "DataValidationError", | ||
| "EstimationError", | ||
| "InsufficientDataError", | ||
| "MemoryError", | ||
| "ModelValidationError", | ||
| "OutcomeModelError", | ||
| "PairwiseEvaluationError", | ||
| "PolicyInductionError", | ||
| "PropensityScoreError", | ||
| "SkdrEvalError", | ||
| "VersionError", |
There was a problem hiding this comment.
The all list includes many functions and classes that are not imported in the file. These exports will cause ImportError when users try to import them, as they are not available in the module namespace.
Pull Request
📋 Description
Type of Change
🔗 Related Issues
🧪 Testing
Test Coverage
Manual Testing
make checkTest commands run:
# List the commands you used to test make check python examples/quickstart.py📝 Changes Made
Code Changes
API Changes
API changes:
📚 Documentation
✅ Checklist
Code Quality
ruff checkandruff formatmypytype checkingTesting & CI
Documentation & Communication
🔍 Review Notes
Focus Areas
Questions for Reviewers
📸 Screenshots/Examples
🚀 Deployment Notes
Additional Context: