Skip to content

jLevar/ml-uvu-4700

Repository files navigation

Credit Card Fraud Detection

Overview

This project builds a machine learning classifier to detect fraudulent credit card transactions using the IEEE-CIS dataset. The goal is to identify fraudulent transactions while minimizing false negatives, which are the most costly errors in this domain.

Features

  • Transaction information (amount, time, product code, card info)
  • Email domains and billing/shipping addresses
  • Engineered and preprocessed features from the raw dataset

Preprocessing

  • Dropped columns with high missing values
  • Imputed missing values with median (numerical) or mode (categorical)
  • Encoded categorical variables using one-hot (low-cardinality) or ordinal encoding (high-cardinality)
  • Scaled features using StandardScaler
  • Reduced dimensionality using PCA to retain 90% of variance (70 components)

Models

  • Logistic Regression (class_weight='balanced')
  • K-Nearest Neighbors
  • Support Vector Machine (LinearSVC)
  • Random Forest (class_weight='balanced')
  • XGBoost (scale_pos_weight adjusted for imbalance)

Evaluation Metrics

  • Precision: Fraction of predicted frauds that were actually fraud
  • Recall: Fraction of actual frauds that were correctly predicted (most important in this use case)
  • F1 Score: Harmonic mean of precision and recall
  • ROC AUC: Area under the receiver operating characteristic curve
  • Confusion Matrix: True/false positives and negatives
  • Training time per model

Results

  • K-Nearest Neighbors achieved the best balance of F1 and recall on the test set.
  • ROC curves and precision-recall trade-offs were visualized to assess model performance across thresholds.
  • Optimal thresholds for maximum F1 score were determined for each model.

Other Homeworks in This Repo

This repository also includes a record of other homeworks completed in class, covering a wide range of machine learning topics:

  • May 7 – Chapter 1: Giving computers the ability to learn from data; types of machine learning; Python basics for ML.
  • May 12 & 14 – Chapter 2: Training simple algorithms for classification; perceptrons and adaptive linear neurons.
  • May 19–28 – Chapter 3: Touring classifiers in scikit-learn; logistic regression, SVMs, decision trees, KNN.
  • June 2–4 – Chapter 4: Data preprocessing; handling missing and categorical data; feature scaling and importance.
  • June 9–11 – Chapter 5: Dimensionality reduction; PCA, LDA, nonlinear methods.
  • June 18–23 – Chapter 6: Model evaluation; k-fold cross-validation, pipelines, learning curves, hyperparameter tuning.
  • June 25–30 – Chapter 7: Ensemble learning; bagging, boosting, adaptive and gradient boosting.
  • July 2–7 – Chapter 9: Regression analysis; linear, robust, polynomial, and random forest regression.
  • July 9–14 – Chapter 10: Clustering; k-means, hierarchical, DBSCAN.
  • July 16–21 – Chapter 11: Multilayer artificial neural networks; training and convergence.
  • July 23–30 – Chapter 12: Parallelizing neural network training with PyTorch; building input pipelines, NN model design, activation functions.

Releases

No releases published

Packages

 
 
 

Contributors