Skip to content

Add Statistical Significance for Subgroup Discovery#57

Open
Rilanja wants to merge 25 commits intoflemmerich:masterfrom
Rilanja:nullDistribution
Open

Add Statistical Significance for Subgroup Discovery#57
Rilanja wants to merge 25 commits intoflemmerich:masterfrom
Rilanja:nullDistribution

Conversation

@Rilanja
Copy link
Copy Markdown

@Rilanja Rilanja commented Apr 1, 2025

This PR introduces a framework for assessing statistical significance of discovered subgroups via permutation testing, implemented through composition to preserve original functionality.

Key Changes:

  1. StatisticalSignificance Class

    • Performs permutation tests to generate null distribution
    • Progress bar integration
    • Supports multiple testing corrections (Bonferroni, Holm, ...)
    • Implements both parametric (normal, gumbel_r) and non-parametric (empirical p-values)
    • Handles different target types (binary, numeric, frequent itemsets)
  2. Stats

    • Wrapper for existing search strategies
    • Parallel permutation testing implementation
  3. SignificantSubgroupResult

    • Enhanced result class with p-values (incl. method name in column header) and adjusted p-values
    • Adds statistical metrics to dataframe
  4. Visualization

    • plot_null_distribution() function for analysis:
      • Histogram + KDE of null distribution
      • Observed quality reference line
      • Optional comparison with theoretical distributions (Gumbel/Normal)

Technical Notes:

  • Required monkey-patching of SelectorBase serialization methods to enable parallel processing
  • New dependencies:
    • joblib (parallel processing)
    • statsmodels (multiple testing correction; normal and gumbel distribution)
    • tqdm and tqdm_joblib (progress bar and patch for parallel processing)
    • seaborn (Plotting of histogram and KDE)
  • Install all with:
pip install joblib statsmodels tqdm tqdm_joblib seaborn

Usage Example:

  • Added metrics.ipynb demonstrating:
    • Significance testing workflow
    • Multiple comparison corrections
    • Null distribution visualization

Rilanja added 25 commits March 16, 2025 11:42
…on and the presentation in the SubgroupDiscoveryResult, and a wrapper for search function for flexibility.
…solution (I found) was to extract all components for a SubgroupDiscoveryTask object and independently recreating it in a worker function. You have to manually define all possible parameters (e.g. constraint or value to ignore in 'create_selectors'. Maybe not the best solution but parallelization works so far.
…te values (-inf, inf) back that messed up my normality test. Placed filter for this.
… that come with it. No way to extract the ignore attribute, constraint, ...
…mal or gumbel distribution in visualization. Small corrections on docstrings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant