A production-ready machine learning pipeline for SMS spam classification, built with MLOps best practices using DVC (Data Version Control), experiment tracking, and automated workflows.
- Overview
- Features
- Project Structure
- Installation
- Usage
- Pipeline Stages
- Experiment Tracking
- Model Performance
- Technologies Used
- Contributing
- License
This project implements an end-to-end ML pipeline for SMS spam detection using Natural Language Processing (NLP) techniques. The pipeline is fully automated using DVC, allowing for reproducible experiments, version control of data and models, and easy parameter tuning.
The model achieves 97.07% accuracy and 98.33% precision on the test set, making it highly effective at identifying spam messages while minimizing false positives.
- Automated ML Pipeline: Complete DVC pipeline from data ingestion to model evaluation
- Experiment Tracking: Built-in experiment tracking with DVCLive
- Version Control: Data and model versioning using DVC
- Cloud Storage: S3 integration for storing pipeline artifacts
- Comprehensive Logging: Detailed logging at every pipeline stage
- Parameterized Pipeline: Easy hyperparameter tuning via
params.yaml - Modular Code: Clean, maintainable, and well-documented codebase
├── .dvc/ # DVC configuration
│ ├── config # DVC remote storage config (S3)
│ └── .gitignore
├── data/
│ ├── raw/ # Raw train/test splits
│ ├── interm/ # Preprocessed data
│ └── processed/ # Feature-engineered data (TF-IDF)
├── models/
│ └── model.pkl # Trained model artifact
├── reports/
│ └── metrics.json # Model evaluation metrics
├── dvclive/ # DVCLive experiment tracking
│ ├── metrics.json # Latest metrics
│ ├── params.yaml # Latest parameters
│ └── plots/ # Metric plots over experiments
├── experiments/
│ └── notebook.ipynb # Jupyter notebook for exploration
└── spam.csv # Data
├── src/
│ ├── data_ingestion.py # Data loading and splitting
│ ├── data_preprocessing.py # Text cleaning and encoding
│ ├── feature_engineering.py # TF-IDF vectorization
│ ├── model_building.py # Model training
│ └── model_evaluation.py # Model evaluation and tracking
├── logs/ # Application logs
├── dvc.yaml # DVC pipeline definition
├── params.yaml # Hyperparameters configuration
├── projectflow.txt # Project workflow documentation
└── README.md # Project documentation
- Python 3.9 or higher
- Git
- AWS Account (for S3 storage)
- pip package manager
-
Clone the repository
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git cd YOUR_REPO_NAME -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK data
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')" -
Configure AWS credentials (for DVC S3 remote)
aws configure # Enter your AWS Access Key ID, Secret Access Key, and region -
Initialize DVC (if not already initialized)
dvc init dvc remote add -d dvcstore s3://your-s3-bucket-name
Execute the entire pipeline from data ingestion to model evaluation:
dvc reproYou can also run specific pipeline stages:
# Data ingestion only
python src/data_ingestion.py
# Data preprocessing only
python src/data_preprocessing.py
# Feature engineering only
python src/feature_engineering.py
# Model building only
python src/model_building.py
# Model evaluation only
python src/model_evaluation.pyVisualize the pipeline dependencies:
dvc dagRun experiments with different parameters:
# Run a new experiment
dvc exp run
# View all experiments
dvc exp show
# Apply a specific experiment
dvc exp apply <experiment-name>
# Remove an experiment
dvc exp remove <experiment-name>Edit params.yaml to tune hyperparameters:
data_ingestion:
test_size: 0.3
feature_engineering:
max_features: 500
model_building:
n_estimators: 25
random_state: 2Then rerun the pipeline:
dvc repro- Input: Raw SMS dataset from URL
- Process:
- Load data from CSV
- Drop unnecessary columns
- Rename columns to
targetandtext - Split into train/test sets (70/30 split by default)
- Output:
data/raw/train.csv,data/raw/test.csv
- Input: Raw train/test data
- Process:
- Label encode target column (ham=0, spam=1)
- Remove duplicates
- Text transformation:
- Lowercase conversion
- Tokenization
- Remove non-alphanumeric characters
- Remove stopwords and punctuation
- Stemming (Porter Stemmer)
- Output:
data/interm/train_processed.csv,data/interm/test_processed.csv
- Input: Preprocessed train/test data
- Process:
- Apply TF-IDF vectorization
- Configure max_features (default: 500)
- Transform text to numerical features
- Output:
data/processed/train_tfidf.csv,data/processed/test_tfidf.csv
- Input: TF-IDF transformed features
- Process:
- Train Random Forest Classifier
- Configurable hyperparameters (n_estimators, random_state)
- Serialize trained model
- Output:
models/model.pkl
- Input: Trained model and test data
- Process:
- Generate predictions
- Calculate metrics (accuracy, precision, recall, AUC)
- Log metrics with DVCLive
- Save metrics to JSON
- Output:
reports/metrics.json, DVCLive artifacts
This project uses DVCLive for experiment tracking. Each experiment run logs:
- Metrics: Accuracy, Precision, Recall, AUC
- Parameters: All hyperparameters from
params.yaml - Plots: Metric evolution across experiments
View experiments:
# In terminal
dvc exp show
# Or use DVC extension in VSCode for visualizationCurrent best model performance on test set:
| Metric | Score |
|---|---|
| Accuracy | 97.07% |
| Precision | 98.33% |
| Recall | 79.73% |
Model Details:
- Algorithm: Random Forest Classifier
- TF-IDF Features: 500
- N Estimators: 25
- Random State: 2
- scikit-learn: ML algorithms and preprocessing
- pandas: Data manipulation
- numpy: Numerical computing
- nltk: Natural language processing
- DVC: Data and model versioning, pipeline orchestration
- DVCLive: Experiment tracking and metrics logging
- Git: Code version control
- AWS S3: Remote storage for artifacts
- Python: Core programming language
- PyYAML: Configuration management
- pickle: Model serialization
- logging: Application logging
- Add API endpoint for model inference
- Implement model monitoring and drift detection
- Add more ML algorithms for comparison
- Create Streamlit/Gradio web interface
- Set up CI/CD pipeline with GitHub Actions
- Add unit tests and integration tests
- Implement data validation with Great Expectations
- Add model explainability (LIME/SHAP)
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset: SMS Spam Collection Dataset
- Thanks to the open-source community
If you have any questions or need help, please:
- Open an issue on GitHub
- Contact me via email: aadubey1106@gmail.com
⭐ If you find this project helpful, please give it a star!