Skip to content

Fashimpaur/ai-etl-anomaly-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Driven ETL Anomaly Detection

Python FastAPI Scikit-Learn

Project Overview

This project demonstrates an AI-powered ETL pipeline that automatically detects anomalies and data quality issues in structured datasets. It integrates data ingestion, preprocessing, machine learning-based anomaly detection, and a FastAPI deployment to provide actionable insights.

The repository is designed to showcase skills in:

  • Python-based ETL pipelines
  • Machine Learning (anomaly detection using Scikit-Learn)
  • Backend deployment with FastAPI
  • Data quality monitoring and reporting
  • Clean, professional project structure for enterprise-level applications

Features

  • Ingest data from CSV files or databases
  • Perform data cleaning and preprocessing
  • Feature engineering for anomaly detection
  • Train and evaluate an ML model to detect anomalies
  • Generate reports highlighting data quality issues
  • Expose a FastAPI /predict endpoint for real-time anomaly scoring

Data

Synthetic Transactions Dataset

  • Path: data/raw/synthetic_transactions.csv
  • Includes:
    • Normal transactions
    • Injected anomalies: large amounts, negative/zero values, category deviations
  • Fully included in the repo for exploration and modeling

Kaggle Credit Card Fraud Dataset (Not Included)

  • Original dataset: Credit Card Fraud Detection
  • Note: creditcard.csv is too large for GitHub, so it is not included in this repo.
  • To use the Kaggle dataset locally:
    1. Sign in to Kaggle and download creditcard.csv.
    2. Place it in the folder: data/raw/creditcard.csv
    3. The Day 3 notebook will automatically load it from this path.
# Example: loading Kaggle data
import pandas as pd

df_kaggle = pd.read_csv('data/raw/creditcard.csv')

Installation

This project uses Pipenv for dependency management:

pipenv install --dev
pipenv shell

Alternatively, if you prefer pip:

pip install -r requirements.txt

Usage

Notebooks Overview

This project includes a set of structured Jupyter notebooks that walk through the full lifecycle of the anomaly detection pipeline:

01-data-exploration.ipynb

Initial EDA, anomaly visualization, data distributions, missing values, and exploratory insights.

02-preprocessing.ipynb

ETL pipeline construction, cleaning, scaling, handling skewed features, and feature engineering.

03-ml-training.ipynb

Model training (Isolation Forest or others), tuning, evaluation metrics, ROC/AUC, and result interpretation.

A detailed explanation for each notebook is provided inside the notebooks/README.md to help reviewers understand design decisions and methodology.

FastAPI

Run the API locally:

python src/api.py

Example request:

POST /predict
Content-Type: application/json

[
    {
        "feature1": 10,
        "feature2": "A",
        "feature3": 3.14
    }
]

Example response:

[
    {
        "anomaly": true,
        "score": -0.65
    }
]

Project Structure

ai-etl-anomaly-detection/
    data/
    notebooks/
    src/
        data_loader.py
        preprocessing.py
        feature_engineering.py
        model.py
        evaluate.py
        api.py
    models/
    tests/
    Pipfile
    Pipfile.lock
    requirements.txt (optional)
    README.md
    .gitignore

License

This project is licensed under the MIT License.


Next Steps / Enhancements

  • Add automated ETL orchestration with Airflow
  • Implement real-time anomaly monitoring dashboards
  • Include additional ML models (e.g., Autoencoders) for advanced anomaly detection
  • Deploy API to cloud services (AWS, GCP, Azure)

Contact / Author

D Fashimpaur
LinkedIn