Skip to content

Batoste/NLP-Sentiment-Analysis-Pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Sentiment Analysis Pipeline

Language Library Technique

An end-to-end Machine Learning pipeline for binary sentiment classification, utilizing an advanced Ensemble Voting strategy and N-gram TF-IDF vectorization.

Engineering Goal: Maximize classification accuracy on movie reviews by combining the strengths of linear models (Logistic Regression, SVM) and tree-based models (Random Forest) via soft voting.

Academic Context: Final Project for the "Introduction to Data Science" course at Institut Teknologi Sepuluh Nopember (ITS), Indonesia (2025). Ranked via a private Kaggle competition.


🏗 Architecture & Methodology

This project implements a robust pipeline designed for high-dimensional text data. Unlike simple baselines, it uses advanced feature extraction and model stacking.

Key Steps:

  1. Advanced Preprocessing (src/preprocess.py):
    • Noise Removal: Regex-based cleaning to strip URLs, user handles (@user), and non-alphabetic characters.
    • Normalization: Lowercasing and whitespace trimming.
  2. Feature Extraction:
    • TF-IDF Vectorization: configured with N-grams (1, 3) to capture phrase-level context (e.g., "not good").
    • Optimization: Uses sublinear_tf scaling and strict document frequency limits (min_df=3, max_df=0.7) to reduce feature noise.
  3. Ensemble Modeling (src/model.py):
    • Voting Classifier (Soft Vote): Combines probability predictions from three distinct models:
      • Logistic Regression: For linear separability and speed.
      • Random Forest: To capture non-linear relationships.
      • SVM (RBF/Linear): For high-dimensional margin maximization.
  4. Evaluation:
    • 5-Fold Cross-Validation to ensure model stability.
    • Metrics: Accuracy, Precision, Recall, and F1-Score.

📂 Project Structure

├── data/
│   ├── train.csv          # Labelled training dataset
│   └── test.csv           # Unlabelled testing dataset
├── notebooks/
│   └── eda.ipynb          # Exploratory Analysis (Class balance, Text length distribution)
├── src/
│   ├── preprocess.py      # Regex cleaning & TF-IDF N-gram vectorizer
│   ├── model.py           # Ensemble Voting Classifier definition
│   └── predict.py         # Inference logic for generating submissions
├── outputs/
│   └── submission.csv     # Final predictions generated by the pipeline
├── main.py                # Entry point orchestrating the full flow
└── requirements.txt       # Python dependencies

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • Key Libraries: scikit-learn, pandas, nltk, textblob

Installation

git clone https://github.com/noecrn/NLP-Sentiment-Analysis-Pipeline.git
cd NLP-Sentiment-Analysis-Pipeline
pip install -r requirements.txt

Usage

Run the full pipeline (Clean Vectorize Train Ensemble Predict):

python main.py

This script will:

  1. Load and clean the training data.
  2. Train the Voting Classifier and print Cross-Validation accuracy.
  3. Generate predictions for test.csv and save them to outputs/submission.csv.

📊 Performance

  • Vectorization: TF-IDF (Unigrams + Bigrams + Trigrams)
  • Validation Strategy: 5-Fold Cross-Validation
  • Target Metric: Accuracy

👤 Author

  • Noé Cornu - Engineering Student @EPITA - GitHub | LinkedIn
  • Baptiste Rio - Engineering Student @ EPITA

About

An end-to-end NLP pipeline for binary sentiment classification. Features an Ensemble Voting Classifier (Logistic Regression, Random Forest, SVM) and optimized TF-IDF N-gram vectorization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 76.5%
  • Python 23.5%