Skip to content

Add Imbalance Detection & Visualization Utility #23

@ShrutiPatel263

Description

@ShrutiPatel263

🎯 Problem

DsKit currently has apply_smote() in imbalance.py to handle imbalanced data, but there's no function to detect if data is imbalanced in the first place.
Users need to manually check class distributions before knowing whether to apply SMOTE or other techniques.

💡 Proposed Solution

Add a detect_imbalance() function to the existing dskit/imbalance.py module that:

Analyzes class distribution
Calculates imbalance severity
Visualizes the distribution
Recommends handling strategies

📦 What I'll Add

Function: detect_imbalance()
pythonfrom dskit.imbalance import detect_imbalance

result = detect_imbalance(df, 'target_column', visualize=True)

Output:

- Bar plot showing class distribution

- Imbalance severity score (0-1)

- Recommended strategies based on severity

Features:

Calculate class distribution and ratios
Compute imbalance severity score
Generate visualization (bar plot with percentages)
Recommend strategies: class weights, SMOTE, undersampling, etc.
Support for both DataFrame and array inputs

Example Return:
python{
'is_imbalanced': True,
'severity': 0.75, # 0-1 scale
'class_distribution': {'class_0': 900, 'class_1': 100},
'imbalance_ratio': 9.0, # majority:minority
'recommendations': [
'Use SMOTE for oversampling minority class',
'Apply class weights in model training',
'Consider threshold tuning'
]
}

🔧 Implementation Details

File to modify: dskit/imbalance.py (existing file)
Function signature:
pythondef detect_imbalance(
data: Union[pd.DataFrame, np.ndarray],
target: Union[str, np.ndarray],
threshold: float = 0.3,
visualize: bool = True,
recommend: bool = True
) -> Dict[str, Any]:
"""
Detect and analyze class imbalance in classification data.

Parameters
----------
data : DataFrame or array-like
    Dataset containing features and target
target : str or array-like
    Target column name (if data is DataFrame) or target array
threshold : float, default=0.3
    Imbalance threshold (e.g., 0.3 means 30:70 split triggers warning)
visualize : bool, default=True
    Generate visualization of class distribution
recommend : bool, default=True
    Provide handling strategy recommendations
    
Returns
-------
dict
    Contains is_imbalanced, severity, distribution, ratio, recommendations
"""

📊 Visualization

The function will generate a bar plot showing:

Class counts with percentages
Visual indication of imbalance severity
Color coding (green=balanced, yellow=moderate, red=severe)

💪 Why This Matters

Current workflow:
python# Users do this manually
print(df['target'].value_counts())
Then decide if they need SMOTE
After this feature:
python# One line detects and recommends
result = detect_imbalance(df, 'target')
if result['is_imbalanced']:
X_bal, y_bal = apply_smote(X, y) # existing function
Impact: Completes the imbalance handling workflow (detect → handle)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions