Credit Card Fraud Detection

Overview

This project builds a machine learning classifier to detect fraudulent credit card transactions using the IEEE-CIS dataset. The goal is to identify fraudulent transactions while minimizing false negatives, which are the most costly errors in this domain.

Features

Transaction information (amount, time, product code, card info)
Email domains and billing/shipping addresses
Engineered and preprocessed features from the raw dataset

Preprocessing

Dropped columns with high missing values
Imputed missing values with median (numerical) or mode (categorical)
Encoded categorical variables using one-hot (low-cardinality) or ordinal encoding (high-cardinality)
Scaled features using StandardScaler
Reduced dimensionality using PCA to retain 90% of variance (70 components)

Models

Logistic Regression (class_weight='balanced')
K-Nearest Neighbors
Support Vector Machine (LinearSVC)
Random Forest (class_weight='balanced')
XGBoost (scale_pos_weight adjusted for imbalance)

Evaluation Metrics

Precision: Fraction of predicted frauds that were actually fraud
Recall: Fraction of actual frauds that were correctly predicted (most important in this use case)
F1 Score: Harmonic mean of precision and recall
ROC AUC: Area under the receiver operating characteristic curve
Confusion Matrix: True/false positives and negatives
Training time per model

Results

K-Nearest Neighbors achieved the best balance of F1 and recall on the test set.
ROC curves and precision-recall trade-offs were visualized to assess model performance across thresholds.
Optimal thresholds for maximum F1 score were determined for each model.

Other Homeworks in This Repo

This repository also includes a record of other homeworks completed in class, covering a wide range of machine learning topics:

May 7 – Chapter 1: Giving computers the ability to learn from data; types of machine learning; Python basics for ML.
May 12 & 14 – Chapter 2: Training simple algorithms for classification; perceptrons and adaptive linear neurons.
May 19–28 – Chapter 3: Touring classifiers in scikit-learn; logistic regression, SVMs, decision trees, KNN.
June 2–4 – Chapter 4: Data preprocessing; handling missing and categorical data; feature scaling and importance.
June 9–11 – Chapter 5: Dimensionality reduction; PCA, LDA, nonlinear methods.
June 18–23 – Chapter 6: Model evaluation; k-fold cross-validation, pipelines, learning curves, hyperparameter tuning.
June 25–30 – Chapter 7: Ensemble learning; bagging, boosting, adaptive and gradient boosting.
July 2–7 – Chapter 9: Regression analysis; linear, robust, polynomial, and random forest regression.
July 9–14 – Chapter 10: Clustering; k-means, hierarchical, DBSCAN.
July 16–21 – Chapter 11: Multilayer artificial neural networks; training and convergence.
July 23–30 – Chapter 12: Parallelizing neural network training with PyTorch; building input pipelines, NN model design, activation functions.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
final-output.txt		final-output.txt
final.ipynb		final.ipynb
hw3.ipynb		hw3.ipynb
hw4.ipynb		hw4.ipynb
hw5.ipynb		hw5.ipynb
hw6.ipynb		hw6.ipynb
hw7.ipynb		hw7.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection

Overview

Features

Preprocessing

Models

Evaluation Metrics

Results

Other Homeworks in This Repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection

Overview

Features

Preprocessing

Models

Evaluation Metrics

Results

Other Homeworks in This Repo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages