Skip to content

Spam Email Classifier, involves data loading, data preprocessing, feature extraction, model training, evaluation and deployed in a basic Flask Application. Dataset used: Enron Emails

Notifications You must be signed in to change notification settings

hub-mm/spam_email_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spam Email Classifier using Machine Learning

Table of Contents

Overview

This project implements a machine learning pipeline to classify emails as spam or not spam using the Enron email dataset. The pipeline includes data loading, preprocessing, feature extraction, model training, evaluation, and deployment as a web application.

Dataset

The Enron email dataset is a large corpus of real emails that have been made public and are suitable for research purposes. It contains approximately 0.5 million emails from about 150 users.

Project Structure

Installation

Prerequisites

  • Python 3.7 or higher
  • Git

Steps

  1. Clone Repository
    git clone https://github.com/hub-mm/spam_email_classifier.git
  2. Move into Repository
    cd spam_email_classifier
  3. Create Virtual Environment
    python -m venv venv
    source venv/bin/activate
    # On Windows use:
    venv\Scripts\activate
  4. Install Required Packages
    pip install -r requirements.txt
  5. Download NLTK data
    python utils/setup_nltk.py
  6. Download the Enron Dataset
    Download the dataset from here and extract it into data/processed/raw/maildir/

Usage

Data Preparation

  1. Load Emails
    python scripts/data_loading.py

This script loads emails from the Enron dataset and saves them to data/processed/emails.csv

  1. Label Emails
    python scripts/data_labelling.py

Labels emails as spam or not spam based on keyword matching and saves the result to data/processed/emails_labelled.csv

  1. Preprocess Emails
    python scripts/data_preprocessing.py

Cleans the email text and saves the preprocessed data to data/processed/emails_preprocessed.csv

Model Training

  1. Extract Features
    python scripts/extract_features.py

Extracts TF-IDF features from the preprocessed emails and splits the data into training and testing sets.

  1. Train Model
    python scripts/train_model.py

Trains a Multinomial Naive Bayes classifier and saves the model to models/spam_classifier.pkl

Model Evaluation

    python scripts/evaluate_model.py

Evaluates the trained model on the test set and displays classification metrics and plots.

Deployment

    python scripts/deploy_model.py

Runs the Flask web application for classifying emails via a web interface or API.

Web Application

The web application allows users to input email content and classify it as spam or not spam.

  • Home Page
  • ** Classification Result

Running the App

    python scripts/deploy_model.py

Navigate to http://localhost:5000 in your web browser.

API Endpoint

/predict (POST)

  • Description: Classifies the given email content.
  • Requested Body: { 'email': 'Your email content here' }
  • Response: { 'spam': true, 'probability': 0.95 }

Contributions

Contributions are welcome! Please open an issue or submit a pull request for any bugs, enhancements, or features.

Acknowledgement

About

Spam Email Classifier, involves data loading, data preprocessing, feature extraction, model training, evaluation and deployed in a basic Flask Application. Dataset used: Enron Emails

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published