- Overview
- Dataset
- Project Structure
- Installation
- Usage
- Web Application
- API Endpoint
- Contributions
- Acknowledgment
This project implements a machine learning pipeline to classify emails as spam or not spam using the Enron email dataset. The pipeline includes data loading, preprocessing, feature extraction, model training, evaluation, and deployment as a web application.
The Enron email dataset is a large corpus of real emails that have been made public and are suitable for research purposes. It contains approximately 0.5 million emails from about 150 users.
- Python 3.7 or higher
- Git
- Clone Repository
git clone https://github.com/hub-mm/spam_email_classifier.git
- Move into Repository
cd spam_email_classifier
- Create Virtual Environment
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
- Install Required Packages
pip install -r requirements.txt
- Download NLTK data
python utils/setup_nltk.py
- Download the Enron Dataset
Download the dataset from here and extract it into data/processed/raw/maildir/
- Load Emails
python scripts/data_loading.py
This script loads emails from the Enron dataset and saves them to data/processed/emails.csv
- Label Emails
python scripts/data_labelling.py
Labels emails as spam or not spam based on keyword matching and saves the result to data/processed/emails_labelled.csv
- Preprocess Emails
python scripts/data_preprocessing.py
Cleans the email text and saves the preprocessed data to data/processed/emails_preprocessed.csv
- Extract Features
python scripts/extract_features.py
Extracts TF-IDF features from the preprocessed emails and splits the data into training and testing sets.
- Train Model
python scripts/train_model.py
Trains a Multinomial Naive Bayes classifier and saves the model to models/spam_classifier.pkl
python scripts/evaluate_model.py
Evaluates the trained model on the test set and displays classification metrics and plots.
python scripts/deploy_model.py
Runs the Flask web application for classifying emails via a web interface or API.
The web application allows users to input email content and classify it as spam or not spam.
- Home Page
- ** Classification Result
python scripts/deploy_model.py
Navigate to http://localhost:5000 in your web browser.
- Description: Classifies the given email content.
- Requested Body: { 'email': 'Your email content here' }
- Response: { 'spam': true, 'probability': 0.95 }
Contributions are welcome! Please open an issue or submit a pull request for any bugs, enhancements, or features.