SMS Spam Detection MLOps Pipeline

A production-ready machine learning pipeline for SMS spam classification, built with MLOps best practices using DVC (Data Version Control), experiment tracking, and automated workflows.

📋 Table of Contents

Overview
Features
Project Structure
Installation
Usage
Pipeline Stages
Experiment Tracking
Model Performance
Technologies Used
Contributing
License

🎯 Overview

This project implements an end-to-end ML pipeline for SMS spam detection using Natural Language Processing (NLP) techniques. The pipeline is fully automated using DVC, allowing for reproducible experiments, version control of data and models, and easy parameter tuning.

The model achieves 97.07% accuracy and 98.33% precision on the test set, making it highly effective at identifying spam messages while minimizing false positives.

✨ Features

Automated ML Pipeline: Complete DVC pipeline from data ingestion to model evaluation
Experiment Tracking: Built-in experiment tracking with DVCLive
Version Control: Data and model versioning using DVC
Cloud Storage: S3 integration for storing pipeline artifacts
Comprehensive Logging: Detailed logging at every pipeline stage
Parameterized Pipeline: Easy hyperparameter tuning via params.yaml
Modular Code: Clean, maintainable, and well-documented codebase

📁 Project Structure

├── .dvc/                      # DVC configuration
│   ├── config                 # DVC remote storage config (S3)
│   └── .gitignore
├── data/
│   ├── raw/                   # Raw train/test splits
│   ├── interm/                # Preprocessed data
│   └── processed/             # Feature-engineered data (TF-IDF)
├── models/
│   └── model.pkl              # Trained model artifact
├── reports/
│   └── metrics.json           # Model evaluation metrics
├── dvclive/                   # DVCLive experiment tracking
│   ├── metrics.json           # Latest metrics
│   ├── params.yaml            # Latest parameters
│   └── plots/                 # Metric plots over experiments
├── experiments/
│   └── notebook.ipynb         # Jupyter notebook for exploration
    └── spam.csv               # Data 
├── src/
│   ├── data_ingestion.py      # Data loading and splitting
│   ├── data_preprocessing.py  # Text cleaning and encoding
│   ├── feature_engineering.py # TF-IDF vectorization
│   ├── model_building.py      # Model training
│   └── model_evaluation.py    # Model evaluation and tracking
├── logs/                      # Application logs
├── dvc.yaml                   # DVC pipeline definition
├── params.yaml                # Hyperparameters configuration
├── projectflow.txt            # Project workflow documentation
└── README.md                  # Project documentation

🚀 Installation

Prerequisites

Python 3.9 or higher
Git
AWS Account (for S3 storage)
pip package manager

Setup Steps

Clone the repository

git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download NLTK data

python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Configure AWS credentials (for DVC S3 remote)

aws configure
# Enter your AWS Access Key ID, Secret Access Key, and region

Initialize DVC (if not already initialized)

dvc init
dvc remote add -d dvcstore s3://your-s3-bucket-name

💻 Usage

Running the Complete Pipeline

Execute the entire pipeline from data ingestion to model evaluation:

dvc repro

Running Individual Stages

You can also run specific pipeline stages:

# Data ingestion only
python src/data_ingestion.py

# Data preprocessing only
python src/data_preprocessing.py

# Feature engineering only
python src/feature_engineering.py

# Model building only
python src/model_building.py

# Model evaluation only
python src/model_evaluation.py

Viewing Pipeline DAG

Visualize the pipeline dependencies:

dvc dag

Experiment Tracking

Run experiments with different parameters:

# Run a new experiment
dvc exp run

# View all experiments
dvc exp show

# Apply a specific experiment
dvc exp apply <experiment-name>

# Remove an experiment
dvc exp remove <experiment-name>

Modifying Parameters

Edit params.yaml to tune hyperparameters:

data_ingestion:
  test_size: 0.3

feature_engineering:
  max_features: 500

model_building:
  n_estimators: 25
  random_state: 2

Then rerun the pipeline:

dvc repro

🔄 Pipeline Stages

1. Data Ingestion

Input: Raw SMS dataset from URL
Process:
- Load data from CSV
- Drop unnecessary columns
- Rename columns to target and text
- Split into train/test sets (70/30 split by default)
Output: data/raw/train.csv, data/raw/test.csv

2. Data Preprocessing

Input: Raw train/test data
Process:
- Label encode target column (ham=0, spam=1)
- Remove duplicates
- Text transformation:
  - Lowercase conversion
  - Tokenization
  - Remove non-alphanumeric characters
  - Remove stopwords and punctuation
  - Stemming (Porter Stemmer)
Output: data/interm/train_processed.csv, data/interm/test_processed.csv

3. Feature Engineering

Input: Preprocessed train/test data
Process:
- Apply TF-IDF vectorization
- Configure max_features (default: 500)
- Transform text to numerical features
Output: data/processed/train_tfidf.csv, data/processed/test_tfidf.csv

4. Model Building

Input: TF-IDF transformed features
Process:
- Train Random Forest Classifier
- Configurable hyperparameters (n_estimators, random_state)
- Serialize trained model
Output: models/model.pkl

5. Model Evaluation

Input: Trained model and test data
Process:
- Generate predictions
- Calculate metrics (accuracy, precision, recall, AUC)
- Log metrics with DVCLive
- Save metrics to JSON
Output: reports/metrics.json, DVCLive artifacts

📊 Experiment Tracking

This project uses DVCLive for experiment tracking. Each experiment run logs:

Metrics: Accuracy, Precision, Recall, AUC
Parameters: All hyperparameters from params.yaml
Plots: Metric evolution across experiments

View experiments:

# In terminal
dvc exp show

# Or use DVC extension in VSCode for visualization

🎯 Model Performance

Current best model performance on test set:

Metric	Score
Accuracy	97.07%
Precision	98.33%
Recall	79.73%

Model Details:

Algorithm: Random Forest Classifier
TF-IDF Features: 500
N Estimators: 25
Random State: 2

🛠️ Technologies Used

Machine Learning & Data Science

scikit-learn: ML algorithms and preprocessing
pandas: Data manipulation
numpy: Numerical computing
nltk: Natural language processing

MLOps & Version Control

DVC: Data and model versioning, pipeline orchestration
DVCLive: Experiment tracking and metrics logging
Git: Code version control
AWS S3: Remote storage for artifacts

Development Tools

Python: Core programming language
PyYAML: Configuration management
pickle: Model serialization
logging: Application logging

📈 Future Enhancements

Add API endpoint for model inference
Implement model monitoring and drift detection
Add more ML algorithms for comparison
Create Streamlit/Gradio web interface
Set up CI/CD pipeline with GitHub Actions
Add unit tests and integration tests
Implement data validation with Great Expectations
Add model explainability (LIME/SHAP)

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Dataset: SMS Spam Collection Dataset
Thanks to the open-source community

📞 Support

If you have any questions or need help, please:

Open an issue on GitHub
Contact me via email: aadubey1106@gmail.com

⭐ If you find this project helpful, please give it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.dvc		.dvc
dvclive		dvclive
experiments		experiments
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
projectflow.txt		projectflow.txt

Folders and files

Latest commit

History

Repository files navigation

SMS Spam Detection MLOps Pipeline

📋 Table of Contents

🎯 Overview

✨ Features

📁 Project Structure

🚀 Installation

Prerequisites

Setup Steps

💻 Usage

Running the Complete Pipeline

Running Individual Stages

Viewing Pipeline DAG

Experiment Tracking

Modifying Parameters

🔄 Pipeline Stages

1. Data Ingestion

2. Data Preprocessing

3. Feature Engineering

4. Model Building

5. Model Evaluation

📊 Experiment Tracking

🎯 Model Performance

🛠️ Technologies Used

Machine Learning & Data Science

MLOps & Version Control

Development Tools

📈 Future Enhancements

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages