This project explores the effectiveness of different machine learning classifiers (Support Vector Machines - SVM and CatBoost) combined with various dimensionality reduction and feature selection techniques for detecting fraudulent transactions. It leverages the IEEE-CIS Fraud Detection dataset from Kaggle. A key focus is comparing classical dimensionality reduction methods (PCA, ICA, Truncated SVD) against a nature-inspired approach (Genetic Algorithm - GA) for feature selection, and further optimizing classifier hyperparameters using Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO).
The goal of this project is to build an effective fraud detection model by comparing the performance of SVM and CatBoost classifiers under different feature engineering scenarios. We investigate:
- The impact of classical dimensionality reduction techniques (PCA, ICA, SVD).
- The effectiveness of a Genetic Algorithm (GA) for selecting an optimal feature subset.
- The ability of Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) to fine-tune classifier hyperparameters for improved performance, particularly when combined with GA-selected features.
The project utilizes the IEEE-CIS Fraud Detection dataset.
- Source: Kaggle Competition.
- Size: Contains 590,540 transaction records.
- Features: Over 400 features, split into transaction details (
train_transaction.csv,test_transaction.csv) and identity information (train_identity.csv,test_identity.csv). Features include transaction amount, time, product codes, device types, email domains, etc. - Target Variable:
isFraud(binary: 0 for non-fraudulent, 1 for fraudulent). - Challenge: Highly imbalanced dataset (approx. 3.5% fraudulent transactions) and high dimensionality with many missing values.
- Merging: Transaction and identity datasets were merged based on
TransactionID. - Missing Value Imputation:
- Numeric Features: Imputed with a constant value of 0 (acting as an indicator for missingness where 0 wasn't a natural value).
- Categorical Features: Imputed with the string 'missing'.
- Columns with >90% missing values were initially kept for GA to potentially evaluate.
- Feature Engineering:
id_33(screen resolution) was split intoScreen_WidthandScreen_Height.- Numeric columns with very few unique values (<20) were identified and treated as categorical.
- Encoding:
- Categorical Features: Weight of Evidence (WOE) encoding was used to handle high cardinality and prepare data for SVM.
- Scaling:
- StandardScaler was applied to all features after encoding and imputation.
- Train/Validation Split: Data was split into 80% training and 20% validation sets, stratified by the
isFraudtarget variable. - Class Imbalance Handling:
class_weight='balanced'was used for both SVM and CatBoost during training to address the severe class imbalance.
A fixed target of 30 features/dimensions was set for comparison.
- Principal Component Analysis (PCA): Standard PCA applied to reduce dimensions to 30 components.
- Independent Component Analysis (ICA): FastICA used to find 30 independent components.
- Truncated Singular Value Decomposition (SVD): Applied to reduce dimensions to 30 components.
- Genetic Algorithm (GA):
- Representation: Each individual in the population represented a subset of 30 feature indices.
- Fitness Function: Evaluated individuals based on the ROC AUC score achieved by a classifier (LinearSVC or CatBoost) trained on the corresponding feature subset using a train/validation split.
- Operators: Used standard crossover (union of parent features followed by random sampling to maintain size) and mutation (randomly replacing a feature index).
- Selection: Truncation selection (top 50% survive).
- Goal: Evolve a population of feature subsets to find the one maximizing the classifier's ROC AUC. Separate GA runs were performed for SVM and CatBoost.
- Support Vector Machine (SVM):
- Due to the dataset size,
LinearSVCwas used instead of standardSVC. CalibratedClassifierCVwas wrapped aroundLinearSVCto enable probability predictions (predict_proba) required for ROC AUC calculation.
- Due to the dataset size,
- CatBoost Classifier:
- A gradient boosting library known for its handling of categorical features and good performance. Used with
Loglossand evaluated usingAUC.
- A gradient boosting library known for its handling of categorical features and good performance. Used with
PSO and ACO were applied after feature selection with GA to tune the hyperparameters of the respective classifiers (SVM and CatBoost) using only the GA-selected features.
- Particle Swarm Optimization (PSO):
- Used the
pyswarmlibrary. - Defined bounds for relevant hyperparameters (e.g.,
C,max_iterfor SVM;learning_rate,depth,l2_leaf_reg, etc., for CatBoost). - The objective function trained the classifier with a given set of parameters and returned the validation ROC AUC score.
- PSO iteratively adjusted particle positions (hyperparameter sets) to maximize the objective function.
- Handled parameter type conversions (float to int) and dependencies (e.g.,
max_leavesonly forLossguidepolicy in CatBoost).
- Used the
- Ant Colony Optimization (ACO):
- A custom
ACO_HyperparameterOptimizerclass was implemented. - Parameter Grid: Defined discrete values for each hyperparameter to explore.
- Pheromone Trails: Maintained pheromone levels for each parameter-value pair.
- Solution Construction: Ants probabilistically selected parameter values based on pheromone levels (
alphainfluence). - Objective Function: Similar to PSO, evaluated a constructed hyperparameter set by training the classifier and returning the validation ROC AUC.
- Pheromone Update: Trails were updated based on evaporation (
rho) and deposition (Qscaled by solution score).
- A custom
- Languages: Python 3
- Core Libraries: pandas, numpy, scikit-learn, CatBoost, category_encoders, matplotlib, seaborn, pyswarm, json.
- Environment: Jupyter Notebook (
main.ipynb)
Performance was primarily evaluated using the ROC AUC score on the validation set and Kaggle public/private leaderboards.
(Validation ROC AUC / Kaggle Public Score / Kaggle Private Score)
- PCA + SVM: 0.8126 / 0.837181 / 0.798548
- ICA + SVM: 0.8126 / 0.837181 / 0.798548
- SVD + SVM: 0.8127 / 0.837300 / 0.798838
- GA + SVM: 0.8288 / 0.849871 / 0.808483
- GA + PSO + SVM: 0.8287 / 0.849987 / 0.808601
- GA + ACO + SVM: 0.8033 (Validation) / 0.826630 / 0.780363 (Note: Validation score dropped after final ACO model training)
SVM Insights:
- GA provided a noticeable improvement over classical dimensionality reduction methods for SVM.
- PSO tuning on GA features yielded a marginal improvement on Kaggle scores but not validation AUC compared to just GA.
- ACO tuning significantly degraded SVM performance in the final model evaluation, despite finding a good score during the optimization phase itself, suggesting potential overfitting during the ACO search or instability.
(Validation ROC AUC / Kaggle Public Score / Kaggle Private Score)
- PCA + CatBoost: 0.8802 / 0.363070 / 0.364691 (Poor Kaggle score suggests issues)
- ICA + CatBoost: 0.8041 / 0.528929 / 0.540807 (Poor Kaggle score suggests issues)
- SVD + CatBoost: 0.7955 / 0.363238 / 0.364369 (Poor Kaggle score suggests issues)
- GA + CatBoost (Baseline): 0.9259 / 0.906293 / 0.849037
- GA + PSO + CatBoost: 0.7880 / 0.906293 / 0.849037 (Lower validation AUC, same Kaggle score)
- GA + ACO + CatBoost: 0.9332 / 0.906293 / 0.849037 (Highest validation AUC, same Kaggle score)
CatBoost Insights:
- CatBoost significantly outperformed SVM across most configurations, especially when combined with GA.
- Classical DR methods (PCA, ICA, SVD) performed poorly with CatBoost based on Kaggle scores, indicating potential issues with how these transformations interacted with the CatBoost model, possibly due to bad interpretability caused by transformations or overall complexity of the final model.
- GA feature selection worked very well with CatBoost, yielding the best baseline performance.
- ACO tuning on GA features provided the highest validation ROC AUC for CatBoost.
- PSO tuning on GA features resulted in a lower validation AUC compared to the baseline GA+CatBoost.
- Interestingly, despite different validation scores, the Kaggle scores for GA+CatBoost, GA+PSO+CatBoost, and GA+ACO+CatBoost were identical, suggesting the leaderboard test set might not have been sensitive to the hyperparameter changes found by PSO/ACO or that the baseline GA model was already near optimal for that specific test set.
- Classifier: CatBoost generally outperformed LinearSVC (calibrated) on this dataset, especially after feature selection/reduction.
- Feature Selection: GA proved more effective than PCA, ICA, or SVD, particularly for CatBoost, leading to significantly better results. For SVM, GA also showed a slight edge.
- Hyperparameter Optimization:
- ACO showed promise, achieving the best validation score for CatBoost, although it degraded SVM performance.
- PSO provided marginal or no improvement over the GA baseline for both classifiers in terms of validation AUC, and sometimes resulted in worse performance.
- The identical Kaggle scores for the optimized CatBoost models suggest the gains seen in validation might not always translate directly to unseen data or specific leaderboard splits.
Analysis of feature importance for the GA-selected features using the optimized CatBoost models (PSO and ACO) revealed the top contributing features. While the exact ranking differed slightly between the PSO and ACO models, common important features included C1, C11, C14, card2, addr1, V283, V70, D1, V54, and C2. This indicates that these specific transaction counts, card details, address information, and Vesta-engineered features were highly relevant for fraud detection within the subset identified by GA.
This project demonstrated that combining CatBoost with a Genetic Algorithm for feature selection yielded the most effective fraud detection model on the IEEE-CIS dataset among the tested configurations. While classical dimensionality reduction methods were less successful, especially with CatBoost, GA effectively identified a high-performing 30-feature subset.
Nature-inspired hyperparameter optimization (PSO and ACO) showed mixed results. ACO successfully tuned CatBoost to achieve the highest validation AUC, while PSO struggled to improve upon the baseline GA models. However, these validation improvements didn't translate to better Kaggle scores in this instance.
The results highlight the potential of GAs for feature selection in high-dimensional, imbalanced datasets and suggest that while PSO/ACO can find good hyperparameter sets, careful validation and consideration of potential overfitting during the search process are crucial. CatBoost proved to be a robust classifier for this task.