Skip to content

MLayush-dubey/MLOps-Spam-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMS Spam Detection MLOps Pipeline

A production-ready machine learning pipeline for SMS spam classification, built with MLOps best practices using DVC (Data Version Control), experiment tracking, and automated workflows.

Python DVC AWS License

📋 Table of Contents

🎯 Overview

This project implements an end-to-end ML pipeline for SMS spam detection using Natural Language Processing (NLP) techniques. The pipeline is fully automated using DVC, allowing for reproducible experiments, version control of data and models, and easy parameter tuning.

The model achieves 97.07% accuracy and 98.33% precision on the test set, making it highly effective at identifying spam messages while minimizing false positives.

✨ Features

  • Automated ML Pipeline: Complete DVC pipeline from data ingestion to model evaluation
  • Experiment Tracking: Built-in experiment tracking with DVCLive
  • Version Control: Data and model versioning using DVC
  • Cloud Storage: S3 integration for storing pipeline artifacts
  • Comprehensive Logging: Detailed logging at every pipeline stage
  • Parameterized Pipeline: Easy hyperparameter tuning via params.yaml
  • Modular Code: Clean, maintainable, and well-documented codebase

📁 Project Structure

├── .dvc/                      # DVC configuration
│   ├── config                 # DVC remote storage config (S3)
│   └── .gitignore
├── data/
│   ├── raw/                   # Raw train/test splits
│   ├── interm/                # Preprocessed data
│   └── processed/             # Feature-engineered data (TF-IDF)
├── models/
│   └── model.pkl              # Trained model artifact
├── reports/
│   └── metrics.json           # Model evaluation metrics
├── dvclive/                   # DVCLive experiment tracking
│   ├── metrics.json           # Latest metrics
│   ├── params.yaml            # Latest parameters
│   └── plots/                 # Metric plots over experiments
├── experiments/
│   └── notebook.ipynb         # Jupyter notebook for exploration
    └── spam.csv               # Data 
├── src/
│   ├── data_ingestion.py      # Data loading and splitting
│   ├── data_preprocessing.py  # Text cleaning and encoding
│   ├── feature_engineering.py # TF-IDF vectorization
│   ├── model_building.py      # Model training
│   └── model_evaluation.py    # Model evaluation and tracking
├── logs/                      # Application logs
├── dvc.yaml                   # DVC pipeline definition
├── params.yaml                # Hyperparameters configuration
├── projectflow.txt            # Project workflow documentation
└── README.md                  # Project documentation

🚀 Installation

Prerequisites

  • Python 3.9 or higher
  • Git
  • AWS Account (for S3 storage)
  • pip package manager

Setup Steps

  1. Clone the repository

    git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
    cd YOUR_REPO_NAME
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download NLTK data

    python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
  5. Configure AWS credentials (for DVC S3 remote)

    aws configure
    # Enter your AWS Access Key ID, Secret Access Key, and region
  6. Initialize DVC (if not already initialized)

    dvc init
    dvc remote add -d dvcstore s3://your-s3-bucket-name

💻 Usage

Running the Complete Pipeline

Execute the entire pipeline from data ingestion to model evaluation:

dvc repro

Running Individual Stages

You can also run specific pipeline stages:

# Data ingestion only
python src/data_ingestion.py

# Data preprocessing only
python src/data_preprocessing.py

# Feature engineering only
python src/feature_engineering.py

# Model building only
python src/model_building.py

# Model evaluation only
python src/model_evaluation.py

Viewing Pipeline DAG

Visualize the pipeline dependencies:

dvc dag

Experiment Tracking

Run experiments with different parameters:

# Run a new experiment
dvc exp run

# View all experiments
dvc exp show

# Apply a specific experiment
dvc exp apply <experiment-name>

# Remove an experiment
dvc exp remove <experiment-name>

Modifying Parameters

Edit params.yaml to tune hyperparameters:

data_ingestion:
  test_size: 0.3

feature_engineering:
  max_features: 500

model_building:
  n_estimators: 25
  random_state: 2

Then rerun the pipeline:

dvc repro

🔄 Pipeline Stages

1. Data Ingestion

  • Input: Raw SMS dataset from URL
  • Process:
    • Load data from CSV
    • Drop unnecessary columns
    • Rename columns to target and text
    • Split into train/test sets (70/30 split by default)
  • Output: data/raw/train.csv, data/raw/test.csv

2. Data Preprocessing

  • Input: Raw train/test data
  • Process:
    • Label encode target column (ham=0, spam=1)
    • Remove duplicates
    • Text transformation:
      • Lowercase conversion
      • Tokenization
      • Remove non-alphanumeric characters
      • Remove stopwords and punctuation
      • Stemming (Porter Stemmer)
  • Output: data/interm/train_processed.csv, data/interm/test_processed.csv

3. Feature Engineering

  • Input: Preprocessed train/test data
  • Process:
    • Apply TF-IDF vectorization
    • Configure max_features (default: 500)
    • Transform text to numerical features
  • Output: data/processed/train_tfidf.csv, data/processed/test_tfidf.csv

4. Model Building

  • Input: TF-IDF transformed features
  • Process:
    • Train Random Forest Classifier
    • Configurable hyperparameters (n_estimators, random_state)
    • Serialize trained model
  • Output: models/model.pkl

5. Model Evaluation

  • Input: Trained model and test data
  • Process:
    • Generate predictions
    • Calculate metrics (accuracy, precision, recall, AUC)
    • Log metrics with DVCLive
    • Save metrics to JSON
  • Output: reports/metrics.json, DVCLive artifacts

📊 Experiment Tracking

This project uses DVCLive for experiment tracking. Each experiment run logs:

  • Metrics: Accuracy, Precision, Recall, AUC
  • Parameters: All hyperparameters from params.yaml
  • Plots: Metric evolution across experiments

View experiments:

# In terminal
dvc exp show

# Or use DVC extension in VSCode for visualization

🎯 Model Performance

Current best model performance on test set:

Metric Score
Accuracy 97.07%
Precision 98.33%
Recall 79.73%

Model Details:

  • Algorithm: Random Forest Classifier
  • TF-IDF Features: 500
  • N Estimators: 25
  • Random State: 2

🛠️ Technologies Used

Machine Learning & Data Science

  • scikit-learn: ML algorithms and preprocessing
  • pandas: Data manipulation
  • numpy: Numerical computing
  • nltk: Natural language processing

MLOps & Version Control

  • DVC: Data and model versioning, pipeline orchestration
  • DVCLive: Experiment tracking and metrics logging
  • Git: Code version control
  • AWS S3: Remote storage for artifacts

Development Tools

  • Python: Core programming language
  • PyYAML: Configuration management
  • pickle: Model serialization
  • logging: Application logging

📈 Future Enhancements

  • Add API endpoint for model inference
  • Implement model monitoring and drift detection
  • Add more ML algorithms for comparison
  • Create Streamlit/Gradio web interface
  • Set up CI/CD pipeline with GitHub Actions
  • Add unit tests and integration tests
  • Implement data validation with Great Expectations
  • Add model explainability (LIME/SHAP)

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

If you have any questions or need help, please:


⭐ If you find this project helpful, please give it a star!

About

End-to-end MLOps pipeline for SMS spam detection with DVC, experiment tracking, and AWS S3 integration. 97% accuracy.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors