Spam Email Classifier using Machine Learning

Overview

This project implements a machine learning pipeline to classify emails as spam or not spam using the Enron email dataset. The pipeline includes data loading, preprocessing, feature extraction, model training, evaluation, and deployment as a web application.

Dataset

The Enron email dataset is a large corpus of real emails that have been made public and are suitable for research purposes. It contains approximately 0.5 million emails from about 150 users.

Project Structure

Installation

Prerequisites

Python 3.7 or higher
Git

Steps

Clone Repository

git clone https://github.com/hub-mm/spam_email_classifier.git

Move into Repository
```
cd spam_email_classifier
```

Create Virtual Environment

python -m venv venv
source venv/bin/activate
# On Windows use:
venv\Scripts\activate

Install Required Packages
```
pip install -r requirements.txt
```
Download NLTK data
```
python utils/setup_nltk.py
```
Download the Enron Dataset
Download the dataset from here and extract it into data/processed/raw/maildir/

Usage

Data Preparation

Load Emails
```
python scripts/data_loading.py
```

This script loads emails from the Enron dataset and saves them to data/processed/emails.csv

Label Emails
```
python scripts/data_labelling.py
```

Labels emails as spam or not spam based on keyword matching and saves the result to data/processed/emails_labelled.csv

Preprocess Emails
```
python scripts/data_preprocessing.py
```

Cleans the email text and saves the preprocessed data to data/processed/emails_preprocessed.csv

Model Training

Extract Features
```
python scripts/extract_features.py
```

Extracts TF-IDF features from the preprocessed emails and splits the data into training and testing sets.

Train Model
```
python scripts/train_model.py
```

Trains a Multinomial Naive Bayes classifier and saves the model to models/spam_classifier.pkl

Model Evaluation

    python scripts/evaluate_model.py

Evaluates the trained model on the test set and displays classification metrics and plots.

Deployment

    python scripts/deploy_model.py

Runs the Flask web application for classifying emails via a web interface or API.

Web Application

The web application allows users to input email content and classify it as spam or not spam.

Home Page
** Classification Result

Running the App

    python scripts/deploy_model.py

Navigate to http://localhost:5000 in your web browser.

API Endpoint

/predict (POST)

Description: Classifies the given email content.
Requested Body: { 'email': 'Your email content here' }
Response: { 'spam': true, 'probability': 0.95 }

Contributions

Contributions are welcome! Please open an issue or submit a pull request for any bugs, enhancements, or features.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
models		models
scripts		scripts
templates		templates
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Email Classifier using Machine Learning

Table of Contents

Overview

Dataset

Project Structure

Installation

Prerequisites

Steps

Usage

Data Preparation

Model Training

Model Evaluation

Deployment

Web Application

Running the App

API Endpoint

/predict (POST)

Contributions

Acknowledgement

About

Releases

Packages

Languages

hub-mm/spam_email_classifier

Folders and files

Latest commit

History

Repository files navigation

Spam Email Classifier using Machine Learning

Table of Contents

Overview

Dataset

Project Structure

Installation

Prerequisites

Steps

Usage

Data Preparation

Model Training

Model Evaluation

Deployment

Web Application

Running the App

API Endpoint

/predict (POST)

Contributions

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages