-
Notifications
You must be signed in to change notification settings - Fork 19
Description
🎯 Problem
DsKit currently has apply_smote() in imbalance.py to handle imbalanced data, but there's no function to detect if data is imbalanced in the first place.
Users need to manually check class distributions before knowing whether to apply SMOTE or other techniques.
💡 Proposed Solution
Add a detect_imbalance() function to the existing dskit/imbalance.py module that:
Analyzes class distribution
Calculates imbalance severity
Visualizes the distribution
Recommends handling strategies
📦 What I'll Add
Function: detect_imbalance()
pythonfrom dskit.imbalance import detect_imbalance
result = detect_imbalance(df, 'target_column', visualize=True)
Output:
- Bar plot showing class distribution
- Imbalance severity score (0-1)
- Recommended strategies based on severity
Features:
Calculate class distribution and ratios
Compute imbalance severity score
Generate visualization (bar plot with percentages)
Recommend strategies: class weights, SMOTE, undersampling, etc.
Support for both DataFrame and array inputs
Example Return:
python{
'is_imbalanced': True,
'severity': 0.75, # 0-1 scale
'class_distribution': {'class_0': 900, 'class_1': 100},
'imbalance_ratio': 9.0, # majority:minority
'recommendations': [
'Use SMOTE for oversampling minority class',
'Apply class weights in model training',
'Consider threshold tuning'
]
}
🔧 Implementation Details
File to modify: dskit/imbalance.py (existing file)
Function signature:
pythondef detect_imbalance(
data: Union[pd.DataFrame, np.ndarray],
target: Union[str, np.ndarray],
threshold: float = 0.3,
visualize: bool = True,
recommend: bool = True
) -> Dict[str, Any]:
"""
Detect and analyze class imbalance in classification data.
Parameters
----------
data : DataFrame or array-like
Dataset containing features and target
target : str or array-like
Target column name (if data is DataFrame) or target array
threshold : float, default=0.3
Imbalance threshold (e.g., 0.3 means 30:70 split triggers warning)
visualize : bool, default=True
Generate visualization of class distribution
recommend : bool, default=True
Provide handling strategy recommendations
Returns
-------
dict
Contains is_imbalanced, severity, distribution, ratio, recommendations
"""
📊 Visualization
The function will generate a bar plot showing:
Class counts with percentages
Visual indication of imbalance severity
Color coding (green=balanced, yellow=moderate, red=severe)
💪 Why This Matters
Current workflow:
python# Users do this manually
print(df['target'].value_counts())
Then decide if they need SMOTE
After this feature:
python# One line detects and recommends
result = detect_imbalance(df, 'target')
if result['is_imbalanced']:
X_bal, y_bal = apply_smote(X, y) # existing function
Impact: Completes the imbalance handling workflow (detect → handle)