This project demonstrates an AI-powered ETL pipeline that automatically detects anomalies and data quality issues in structured datasets. It integrates data ingestion, preprocessing, machine learning-based anomaly detection, and a FastAPI deployment to provide actionable insights.
The repository is designed to showcase skills in:
- Python-based ETL pipelines
- Machine Learning (anomaly detection using Scikit-Learn)
- Backend deployment with FastAPI
- Data quality monitoring and reporting
- Clean, professional project structure for enterprise-level applications
- Ingest data from CSV files or databases
- Perform data cleaning and preprocessing
- Feature engineering for anomaly detection
- Train and evaluate an ML model to detect anomalies
- Generate reports highlighting data quality issues
- Expose a FastAPI
/predictendpoint for real-time anomaly scoring
- Path:
data/raw/synthetic_transactions.csv - Includes:
- Normal transactions
- Injected anomalies: large amounts, negative/zero values, category deviations
- Fully included in the repo for exploration and modeling
- Original dataset: Credit Card Fraud Detection
- Note:
creditcard.csvis too large for GitHub, so it is not included in this repo. - To use the Kaggle dataset locally:
- Sign in to Kaggle and download
creditcard.csv. - Place it in the folder:
data/raw/creditcard.csv - The Day 3 notebook will automatically load it from this path.
- Sign in to Kaggle and download
# Example: loading Kaggle data
import pandas as pd
df_kaggle = pd.read_csv('data/raw/creditcard.csv')This project uses Pipenv for dependency management:
pipenv install --dev
pipenv shell
Alternatively, if you prefer pip:
pip install -r requirements.txt
This project includes a set of structured Jupyter notebooks that walk through the full lifecycle of the anomaly detection pipeline:
Initial EDA, anomaly visualization, data distributions, missing values, and exploratory insights.
ETL pipeline construction, cleaning, scaling, handling skewed features, and feature engineering.
Model training (Isolation Forest or others), tuning, evaluation metrics, ROC/AUC, and result interpretation.
A detailed explanation for each notebook is provided inside the notebooks/README.md to help reviewers understand design decisions and methodology.
Run the API locally:
python src/api.py
Example request:
POST /predict
Content-Type: application/json
[
{
"feature1": 10,
"feature2": "A",
"feature3": 3.14
}
]
Example response:
[
{
"anomaly": true,
"score": -0.65
}
]
ai-etl-anomaly-detection/
data/
notebooks/
src/
data_loader.py
preprocessing.py
feature_engineering.py
model.py
evaluate.py
api.py
models/
tests/
Pipfile
Pipfile.lock
requirements.txt (optional)
README.md
.gitignore
This project is licensed under the MIT License.
- Add automated ETL orchestration with Airflow
- Implement real-time anomaly monitoring dashboards
- Include additional ML models (e.g., Autoencoders) for advanced anomaly detection
- Deploy API to cloud services (AWS, GCP, Azure)
D Fashimpaur
LinkedIn