This project builds a machine learning classifier to detect fraudulent credit card transactions using the IEEE-CIS dataset. The goal is to identify fraudulent transactions while minimizing false negatives, which are the most costly errors in this domain.
- Transaction information (amount, time, product code, card info)
- Email domains and billing/shipping addresses
- Engineered and preprocessed features from the raw dataset
- Dropped columns with high missing values
- Imputed missing values with median (numerical) or mode (categorical)
- Encoded categorical variables using one-hot (low-cardinality) or ordinal encoding (high-cardinality)
- Scaled features using
StandardScaler - Reduced dimensionality using PCA to retain 90% of variance (70 components)
- Logistic Regression (
class_weight='balanced') - K-Nearest Neighbors
- Support Vector Machine (
LinearSVC) - Random Forest (
class_weight='balanced') - XGBoost (
scale_pos_weightadjusted for imbalance)
- Precision: Fraction of predicted frauds that were actually fraud
- Recall: Fraction of actual frauds that were correctly predicted (most important in this use case)
- F1 Score: Harmonic mean of precision and recall
- ROC AUC: Area under the receiver operating characteristic curve
- Confusion Matrix: True/false positives and negatives
- Training time per model
- K-Nearest Neighbors achieved the best balance of F1 and recall on the test set.
- ROC curves and precision-recall trade-offs were visualized to assess model performance across thresholds.
- Optimal thresholds for maximum F1 score were determined for each model.
This repository also includes a record of other homeworks completed in class, covering a wide range of machine learning topics:
- May 7 – Chapter 1: Giving computers the ability to learn from data; types of machine learning; Python basics for ML.
- May 12 & 14 – Chapter 2: Training simple algorithms for classification; perceptrons and adaptive linear neurons.
- May 19–28 – Chapter 3: Touring classifiers in scikit-learn; logistic regression, SVMs, decision trees, KNN.
- June 2–4 – Chapter 4: Data preprocessing; handling missing and categorical data; feature scaling and importance.
- June 9–11 – Chapter 5: Dimensionality reduction; PCA, LDA, nonlinear methods.
- June 18–23 – Chapter 6: Model evaluation; k-fold cross-validation, pipelines, learning curves, hyperparameter tuning.
- June 25–30 – Chapter 7: Ensemble learning; bagging, boosting, adaptive and gradient boosting.
- July 2–7 – Chapter 9: Regression analysis; linear, robust, polynomial, and random forest regression.
- July 9–14 – Chapter 10: Clustering; k-means, hierarchical, DBSCAN.
- July 16–21 – Chapter 11: Multilayer artificial neural networks; training and convergence.
- July 23–30 – Chapter 12: Parallelizing neural network training with PyTorch; building input pipelines, NN model design, activation functions.