Dataset quality analysis through richness metrics and information-gain based feature selection.
Author: Mitch Haile | Organization: Data Culpa, Inc. | Website: www.dataculpa.com
# Install
pip install dataculpa-richness
# Analyze a dataset
dataculpa-richness data.csv
# Or use the API
python3 -c "
from dataculpa_richness import analyze_dataset_richness
import pandas as pd
df = pd.read_csv('data.csv')
result = analyze_dataset_richness(df)
print('Recommended columns:', result['mixes']['lean']['columns'])
"- Overview
- Installation
- Features
- Quick Examples
- CLI Usage
- Python API
- Use Cases
- API Reference
- Contributing
- License
dataculpa-richness provides a comprehensive framework for analyzing dataset quality:
- Column richness (0-1): Combines fill rate with balanced entropy
- Schema analysis: Analyzes diversity in dict/JSON structures
- Dependency detection: Uses normalized mutual information
- Information-gain selection: Greedy algorithm for optimal column ordering
- Mutual exclusivity: Finds columns with non-overlapping nulls and similar dependencies
- Recommendations: Suggests "lean" (efficient) and "max" (comprehensive) column sets
pip install dataculpa-richnessgit clone https://github.com/dataculpa/dataculpa-richness.git
cd dataculpa-richness
pip install -e .pip install -e ".[dev]"Requirements:
- Python >= 3.8
- numpy >= 1.20.0
- pandas >= 1.3.0
- scikit-learn >= 1.0.0
Optional: Install Graphviz to render dependency graphs.
Quantifies column quality with a single score (0-1):
richness = fill_rate × balanced_entropy
where balanced_entropy = 4e(1-e) peaks at 0.5 (structured but diverse).
Greedy algorithm that orders columns by incremental value:
- Start with the richest column
- Iteratively add column with max gain:
R_j × (1 - max_dep(j, S)) - Produces "lean" (90% of gain) and "max" (all columns) mixes
Finds column pairs with:
- Non-overlapping nulls: When one has a value, the other is null
- Similar dependencies: Both relate to the same columns
Use cases:
- Schema evolution (
old_email→new_email) - Mutually exclusive alternatives (
home_phonevsmobile_phone) - Conditional fields (
residential_addressvsbusiness_address)
from dataculpa_richness import analyze_dataset_richness
import pandas as pd
df = pd.read_csv("data.csv")
result = analyze_dataset_richness(df, row_sample_frac=0.25)
# Get recommended columns
lean_cols = result["mixes"]["lean"]["columns"]
print(f"Lean mix ({len(lean_cols)} columns):", lean_cols)from dataculpa_richness import column_richness
metrics = column_richness(df["my_column"])
print(f"Richness: {metrics['richness']:.3f}")
print(f"Fill rate: {metrics['fill_rate']:.3f}")from dataculpa_richness import (
build_dependency_matrix,
find_mutually_exclusive_columns,
)
dep_matrix = build_dependency_matrix(df, list(df.columns))
pairs = find_mutually_exclusive_columns(
df, list(df.columns), dep_matrix,
max_null_overlap=0.3,
min_dep_similarity=0.7,
)
for pair in pairs[:5]:
print(f"{pair['col1']} <-> {pair['col2']}: score={pair['mutual_exclusivity_score']:.3f}")dataculpa-richness data.csvdataculpa-richness data.parquet \
--row-frac 0.5 \
--lean-frac 0.85 \
--output-prefix my_analysis \
--dep-min-weight 0.3| File | Description |
|---|---|
{prefix}_column_profile.csv |
Richness metrics per column |
{prefix}_dep_matrix.csv |
Pairwise dependency matrix |
{prefix}_info_gain_path.csv |
Greedy column ordering |
{prefix}_mixes.json |
Lean/max recommendations |
{prefix}_column_deps.dot |
Dependency graph (render with Graphviz) |
{prefix}_mutual_exclusive.csv |
Mutually exclusive pairs |
--row-frac FRAC Fraction of rows to sample (default: 0.25)
--col-sample-size N Per-column sample size
--lean-frac FRAC Lean mix threshold (default: 0.9)
--output-prefix PREFIX Output file prefix
--dep-min-weight WEIGHT Min dependency for graph edges (default: 0.3)
--dep-top-k K Max edges per node in graph (default: 3)
--dep-max-edges N Max total edges in graph (default: 200)
from dataculpa_richness import analyze_dataset_richness
result = analyze_dataset_richness(
df,
row_sample_frac=0.25,
col_sample_size=10000,
lean_fraction=0.9,
)
# Access results
col_profile = result["column_profile"]
dep_matrix = result["dependency_matrix"]
mixes = result["mixes"]from dataculpa_richness import column_richness, schema_arrangement_richness
# Single column
metrics = column_richness(df["column_name"])
# Dict/JSON column
schema_metrics = schema_arrangement_richness(df["json_column"])from dataculpa_richness import dataframe_richness_profile
col_profile, schema_metrics = dataframe_richness_profile(df)
top_cols = col_profile.sort_values("richness", ascending=False).head(10)from dataculpa_richness import build_dependency_matrix, dependency_graphviz
cols = list(df.columns)
dep_matrix = build_dependency_matrix(df, cols)
# Generate graph
dot_src = dependency_graphviz(cols, dep_matrix, min_weight=0.3)
with open("deps.dot", "w") as f:
f.write(dot_src)
# Render: dot -Tpng deps.dot -o deps.pngfrom dataculpa_richness import (
greedy_info_gain_path,
choose_core_mixes_from_info_gain,
)
import numpy as np
richness_vec = np.array([col_profile.loc[c, "richness"] for c in cols])
path = greedy_info_gain_path(richness_vec, dep_matrix)
mixes = choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)
print("Lean mix:", mixes["lean"]["columns"])from dataculpa_richness import rank_columns_given_prior
# What columns to add given you already have user_id?
ranked = rank_columns_given_prior(
cols=list(df.columns),
col_profile=col_profile,
dep_matrix=dep_matrix,
prior_col_name="user_id",
)
for rec in ranked[:5]:
print(f"{rec['column']}: {rec['conditional_richness']:.3f}")from dataculpa_richness import find_mutually_exclusive_columns, analyze_null_overlap
# Find mutually exclusive pairs
pairs = find_mutually_exclusive_columns(
df,
cols=list(df.columns),
dep_matrix=dep_matrix,
max_null_overlap=0.3, # Max 30% rows with both non-null
min_dep_similarity=0.7, # Min 70% dependency similarity
top_k=10,
)
for pair in pairs:
print(f"{pair['col1']} <-> {pair['col2']}")
print(f" Score: {pair['mutual_exclusivity_score']:.3f}")
print(f" Null overlap: {pair['null_overlap']:.1%}")
# Detailed analysis
detail = analyze_null_overlap(df, "col1", "col2")
print(f"Both non-null: {detail['pct_both_non_null']:.1f}%")result = analyze_dataset_richness(df)
features = result["mixes"]["lean"]["columns"]
X_train = df[features]col_profile, _ = dataframe_richness_profile(df)
low_quality = col_profile[col_profile["richness"] < 0.1]
print("Low quality columns:", low_quality.index.tolist())pairs = find_mutually_exclusive_columns(df, cols, dep_matrix)
for pair in pairs:
if pair['mutual_exclusivity_score'] > 0.8:
print(f"⚠️ Possible migration: {pair['col1']} → {pair['col2']}")col_profile, schema_metrics = dataframe_richness_profile(df)
for col, metrics in schema_metrics.items():
print(f"{col}: {metrics['unique_schema_arrangements']} unique schemas")column_richness(series, sample_size=None, random_state=42)
- Compute richness for a single column
- Returns: Dict with
richness,fill_rate,entropy_norm,entropy_balanced, etc.
schema_arrangement_richness(series, sample_size=None, random_state=42)
- Richness over dict/JSON schema arrangements
- Returns: Dict with schema statistics and top arrangements
dataframe_richness_profile(df, sample_size=None, random_state=42, detect_schema_cols=True)
- Profile all columns in a DataFrame
- Returns: (col_profile DataFrame, schema_metrics dict)
analyze_dataset_richness(df, row_sample_frac=0.25, col_sample_size=None, random_state=42, lean_fraction=0.9)
- Full analysis pipeline
- Returns: Dict with
column_profile,dependency_matrix,mixes, etc.
normalized_mi(x, y)
- Normalized mutual information between two series (0-1)
build_dependency_matrix(df, cols)
- Build pairwise dependency matrix using normalized MI
- Returns: numpy array (n_cols × n_cols)
dependency_graphviz(cols, dep_matrix, min_weight=0.2, top_k_per_node=3, max_edges=200, graph_name="ColumnDependencies")
- Generate Graphviz DOT string for dependency graph
greedy_info_gain_path(richness_vec, dep_matrix)
- Greedy ordering of columns by information gain
- Returns: List of dicts with cumulative gains
choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)
- Extract lean and max mixes from greedy path
- Returns: Dict with
lean,max, andpathDataFrame
rank_columns_given_prior(cols, col_profile, dep_matrix, prior_col_name)
- Rank columns by conditional richness given a prior column
- Returns: List of dicts sorted by conditional richness
find_mutually_exclusive_columns(df, cols, dep_matrix, max_null_overlap=0.3, min_dep_similarity=0.7, top_k=20)
- Find columns with non-overlapping nulls and similar dependencies
- Returns: List of dicts with exclusivity scores
analyze_null_overlap(df, col1, col2)
- Detailed null overlap analysis between two columns
- Returns: Dict with counts, percentages, and overlap metrics
Contributions welcome! Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes
- Run tests and linters:
pytest,black .,flake8 - Commit:
git commit -m "Add feature" - Push and create a Pull Request
git clone https://github.com/dataculpa/dataculpa-richness.git
cd dataculpa-richness
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"- Follow PEP 8
- Use Black for formatting (line length: 88)
- Add type hints where appropriate
- Write docstrings for public functions
Apache License 2.0 - See LICENSE file for details.
- Website: www.dataculpa.com
- GitHub: github.com/dataculpa/dataculpa-richness
- Issues: GitHub Issues
@software{dataculpa_richness,
title = {dataculpa-richness: Dataset Quality Analysis and Feature Selection},
author = {Haile, Mitch},
organization = {Data Culpa, Inc.},
year = {2025},
url = {https://www.dataculpa.com}
}Version 0.2.0 | Built with ❤️ by Data Culpa, Inc.