Skip to content

A tool for normalizing data quality metrics into a single measure for columns (richness) and tables (joint richness)

License

Notifications You must be signed in to change notification settings

Data-Culpa/richness

Repository files navigation

dataculpa-richness

Dataset quality analysis through richness metrics and information-gain based feature selection.

Version License Python

Author: Mitch Haile | Organization: Data Culpa, Inc. | Website: www.dataculpa.com


Quick Start

# Install
pip install dataculpa-richness

# Analyze a dataset
dataculpa-richness data.csv

# Or use the API
python3 -c "
from dataculpa_richness import analyze_dataset_richness
import pandas as pd
df = pd.read_csv('data.csv')
result = analyze_dataset_richness(df)
print('Recommended columns:', result['mixes']['lean']['columns'])
"

Table of Contents


Overview

dataculpa-richness provides a comprehensive framework for analyzing dataset quality:

  • Column richness (0-1): Combines fill rate with balanced entropy
  • Schema analysis: Analyzes diversity in dict/JSON structures
  • Dependency detection: Uses normalized mutual information
  • Information-gain selection: Greedy algorithm for optimal column ordering
  • Mutual exclusivity: Finds columns with non-overlapping nulls and similar dependencies
  • Recommendations: Suggests "lean" (efficient) and "max" (comprehensive) column sets

Installation

From PyPI (when published)

pip install dataculpa-richness

From Source

git clone https://github.com/dataculpa/dataculpa-richness.git
cd dataculpa-richness
pip install -e .

Development

pip install -e ".[dev]"

Requirements:

  • Python >= 3.8
  • numpy >= 1.20.0
  • pandas >= 1.3.0
  • scikit-learn >= 1.0.0

Optional: Install Graphviz to render dependency graphs.


Features

1. Column Richness

Quantifies column quality with a single score (0-1):

richness = fill_rate × balanced_entropy

where balanced_entropy = 4e(1-e) peaks at 0.5 (structured but diverse).

2. Information-Gain Selection

Greedy algorithm that orders columns by incremental value:

  1. Start with the richest column
  2. Iteratively add column with max gain: R_j × (1 - max_dep(j, S))
  3. Produces "lean" (90% of gain) and "max" (all columns) mixes

3. Mutual Exclusivity Detection (NEW v0.2.0)

Finds column pairs with:

  • Non-overlapping nulls: When one has a value, the other is null
  • Similar dependencies: Both relate to the same columns

Use cases:

  • Schema evolution (old_emailnew_email)
  • Mutually exclusive alternatives (home_phone vs mobile_phone)
  • Conditional fields (residential_address vs business_address)

Quick Examples

Analyze a Dataset

from dataculpa_richness import analyze_dataset_richness
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze_dataset_richness(df, row_sample_frac=0.25)

# Get recommended columns
lean_cols = result["mixes"]["lean"]["columns"]
print(f"Lean mix ({len(lean_cols)} columns):", lean_cols)

Column Richness

from dataculpa_richness import column_richness

metrics = column_richness(df["my_column"])
print(f"Richness: {metrics['richness']:.3f}")
print(f"Fill rate: {metrics['fill_rate']:.3f}")

Find Mutually Exclusive Columns

from dataculpa_richness import (
    build_dependency_matrix,
    find_mutually_exclusive_columns,
)

dep_matrix = build_dependency_matrix(df, list(df.columns))
pairs = find_mutually_exclusive_columns(
    df, list(df.columns), dep_matrix,
    max_null_overlap=0.3,
    min_dep_similarity=0.7,
)

for pair in pairs[:5]:
    print(f"{pair['col1']} <-> {pair['col2']}: score={pair['mutual_exclusivity_score']:.3f}")

CLI Usage

Basic Analysis

dataculpa-richness data.csv

Custom Parameters

dataculpa-richness data.parquet \
  --row-frac 0.5 \
  --lean-frac 0.85 \
  --output-prefix my_analysis \
  --dep-min-weight 0.3

Output Files

File Description
{prefix}_column_profile.csv Richness metrics per column
{prefix}_dep_matrix.csv Pairwise dependency matrix
{prefix}_info_gain_path.csv Greedy column ordering
{prefix}_mixes.json Lean/max recommendations
{prefix}_column_deps.dot Dependency graph (render with Graphviz)
{prefix}_mutual_exclusive.csv Mutually exclusive pairs

CLI Options

--row-frac FRAC          Fraction of rows to sample (default: 0.25)
--col-sample-size N      Per-column sample size
--lean-frac FRAC         Lean mix threshold (default: 0.9)
--output-prefix PREFIX   Output file prefix
--dep-min-weight WEIGHT  Min dependency for graph edges (default: 0.3)
--dep-top-k K           Max edges per node in graph (default: 3)
--dep-max-edges N       Max total edges in graph (default: 200)

Python API

Full Analysis Pipeline

from dataculpa_richness import analyze_dataset_richness

result = analyze_dataset_richness(
    df,
    row_sample_frac=0.25,
    col_sample_size=10000,
    lean_fraction=0.9,
)

# Access results
col_profile = result["column_profile"]
dep_matrix = result["dependency_matrix"]
mixes = result["mixes"]

Column-Level Analysis

from dataculpa_richness import column_richness, schema_arrangement_richness

# Single column
metrics = column_richness(df["column_name"])

# Dict/JSON column
schema_metrics = schema_arrangement_richness(df["json_column"])

DataFrame Profile

from dataculpa_richness import dataframe_richness_profile

col_profile, schema_metrics = dataframe_richness_profile(df)
top_cols = col_profile.sort_values("richness", ascending=False).head(10)

Dependency Analysis

from dataculpa_richness import build_dependency_matrix, dependency_graphviz

cols = list(df.columns)
dep_matrix = build_dependency_matrix(df, cols)

# Generate graph
dot_src = dependency_graphviz(cols, dep_matrix, min_weight=0.3)
with open("deps.dot", "w") as f:
    f.write(dot_src)
# Render: dot -Tpng deps.dot -o deps.png

Custom Column Selection

from dataculpa_richness import (
    greedy_info_gain_path,
    choose_core_mixes_from_info_gain,
)
import numpy as np

richness_vec = np.array([col_profile.loc[c, "richness"] for c in cols])
path = greedy_info_gain_path(richness_vec, dep_matrix)
mixes = choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)

print("Lean mix:", mixes["lean"]["columns"])

Conditional Richness

from dataculpa_richness import rank_columns_given_prior

# What columns to add given you already have user_id?
ranked = rank_columns_given_prior(
    cols=list(df.columns),
    col_profile=col_profile,
    dep_matrix=dep_matrix,
    prior_col_name="user_id",
)

for rec in ranked[:5]:
    print(f"{rec['column']}: {rec['conditional_richness']:.3f}")

Mutual Exclusivity Detection

from dataculpa_richness import find_mutually_exclusive_columns, analyze_null_overlap

# Find mutually exclusive pairs
pairs = find_mutually_exclusive_columns(
    df,
    cols=list(df.columns),
    dep_matrix=dep_matrix,
    max_null_overlap=0.3,      # Max 30% rows with both non-null
    min_dep_similarity=0.7,     # Min 70% dependency similarity
    top_k=10,
)

for pair in pairs:
    print(f"{pair['col1']} <-> {pair['col2']}")
    print(f"  Score: {pair['mutual_exclusivity_score']:.3f}")
    print(f"  Null overlap: {pair['null_overlap']:.1%}")

# Detailed analysis
detail = analyze_null_overlap(df, "col1", "col2")
print(f"Both non-null: {detail['pct_both_non_null']:.1f}%")

Use Cases

Feature Selection for ML

result = analyze_dataset_richness(df)
features = result["mixes"]["lean"]["columns"]
X_train = df[features]

Data Quality Assessment

col_profile, _ = dataframe_richness_profile(df)
low_quality = col_profile[col_profile["richness"] < 0.1]
print("Low quality columns:", low_quality.index.tolist())

Schema Evolution Detection

pairs = find_mutually_exclusive_columns(df, cols, dep_matrix)
for pair in pairs:
    if pair['mutual_exclusivity_score'] > 0.8:
        print(f"⚠️ Possible migration: {pair['col1']}{pair['col2']}")

Schema Analysis for JSON Columns

col_profile, schema_metrics = dataframe_richness_profile(df)
for col, metrics in schema_metrics.items():
    print(f"{col}: {metrics['unique_schema_arrangements']} unique schemas")

API Reference

Core Functions

column_richness(series, sample_size=None, random_state=42)

  • Compute richness for a single column
  • Returns: Dict with richness, fill_rate, entropy_norm, entropy_balanced, etc.

schema_arrangement_richness(series, sample_size=None, random_state=42)

  • Richness over dict/JSON schema arrangements
  • Returns: Dict with schema statistics and top arrangements

dataframe_richness_profile(df, sample_size=None, random_state=42, detect_schema_cols=True)

  • Profile all columns in a DataFrame
  • Returns: (col_profile DataFrame, schema_metrics dict)

analyze_dataset_richness(df, row_sample_frac=0.25, col_sample_size=None, random_state=42, lean_fraction=0.9)

  • Full analysis pipeline
  • Returns: Dict with column_profile, dependency_matrix, mixes, etc.

Dependency Analysis

normalized_mi(x, y)

  • Normalized mutual information between two series (0-1)

build_dependency_matrix(df, cols)

  • Build pairwise dependency matrix using normalized MI
  • Returns: numpy array (n_cols × n_cols)

dependency_graphviz(cols, dep_matrix, min_weight=0.2, top_k_per_node=3, max_edges=200, graph_name="ColumnDependencies")

  • Generate Graphviz DOT string for dependency graph

Information Gain & Selection

greedy_info_gain_path(richness_vec, dep_matrix)

  • Greedy ordering of columns by information gain
  • Returns: List of dicts with cumulative gains

choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)

  • Extract lean and max mixes from greedy path
  • Returns: Dict with lean, max, and path DataFrame

rank_columns_given_prior(cols, col_profile, dep_matrix, prior_col_name)

  • Rank columns by conditional richness given a prior column
  • Returns: List of dicts sorted by conditional richness

Mutual Exclusivity Detection

find_mutually_exclusive_columns(df, cols, dep_matrix, max_null_overlap=0.3, min_dep_similarity=0.7, top_k=20)

  • Find columns with non-overlapping nulls and similar dependencies
  • Returns: List of dicts with exclusivity scores

analyze_null_overlap(df, col1, col2)

  • Detailed null overlap analysis between two columns
  • Returns: Dict with counts, percentages, and overlap metrics

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes
  4. Run tests and linters: pytest, black ., flake8
  5. Commit: git commit -m "Add feature"
  6. Push and create a Pull Request

Development Setup

git clone https://github.com/dataculpa/dataculpa-richness.git
cd dataculpa-richness
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

Code Style

  • Follow PEP 8
  • Use Black for formatting (line length: 88)
  • Add type hints where appropriate
  • Write docstrings for public functions

License

Apache License 2.0 - See LICENSE file for details.


Support


Citation

@software{dataculpa_richness,
  title = {dataculpa-richness: Dataset Quality Analysis and Feature Selection},
  author = {Haile, Mitch},
  organization = {Data Culpa, Inc.},
  year = {2025},
  url = {https://www.dataculpa.com}
}

Version 0.2.0 | Built with ❤️ by Data Culpa, Inc.

About

A tool for normalizing data quality metrics into a single measure for columns (richness) and tables (joint richness)

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages