Anomaly detection and fraud classification for financial transactions. Addresses class imbalance with SMOTE, Isolation Forest, and ensemble methods.
Fraud detection is difficult because fraudulent transactions are rare (typically 0.1-1% of all transactions). A model predicting "not fraud" for everything achieves 99% accuracy but catches zero fraud.
When you have labeled fraud/not-fraud data:
- Logistic Regression: Interpretable baseline
- Random Forest: Handles non-linear relationships
- XGBoost/LightGBM: State-of-the-art for tabular data
- Neural Networks: For complex patterns with sufficient data
When labels are unavailable:
- Isolation Forest: Isolates anomalies by random partitioning
- One-Class SVM: Learns boundary around normal data
- Autoencoders: High reconstruction error indicates anomaly
# SMOTE Oversampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Class weights in XGBoost
model = XGBClassifier(scale_pos_weight=100)
# Undersampling majority class
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_under, y_under = rus.fit_resample(X_train, y_train)| Metric | Formula | What It Measures |
|---|---|---|
| Precision | TP / (TP + FP) | Of predicted fraud, how many were actually fraud? |
| Recall | TP / (TP + FN) | Of actual fraud, how many did we catch? |
| F1 Score | 2 × (P × R) / (P + R) | Balance of precision and recall |
| PR-AUC | Area under PR curve | Performance across thresholds |
- Transaction amount (unusual values)
- Time of transaction (unusual hours)
- Location mismatch with history
- Merchant category risk
- Transaction velocity (count in last hour/day)
- Amount deviation from typical spending
- New merchant or location
- Failed attempt history
- Device fingerprinting
- IP reputation
- Connection to known fraud rings
from sklearn.ensemble import IsolationForest
# Isolation Forest for anomaly detection
iso_forest = IsolationForest(
contamination=0.01, # expect 1% anomalies
random_state=42,
n_estimators=100
)
# Fit and predict (-1 for anomaly, 1 for normal)
predictions = iso_forest.fit_predict(X)
anomalies = X[predictions == -1]- Python
- Scikit-learn
- XGBoost
- imbalanced-learn (SMOTE)
- Pandas / NumPy
- False positives: Blocking legitimate transactions frustrates customers
- Bias: Models may discriminate based on biased historical data
- Explainability: Regulations may require explaining flagged transactions
- Privacy: Balance security with user privacy in behavioral tracking
Uses the Credit Card Fraud Dataset from Kaggle - anonymized European cardholder transactions with 492 frauds out of 284,807 transactions.
- imbalanced-learn Documentation
- SMOTE: Chawla et al. (2002)
- Isolation Forest: Liu et al. (2008)
MIT