NLP Sentiment Analysis Pipeline

An end-to-end Machine Learning pipeline for binary sentiment classification, utilizing an advanced Ensemble Voting strategy and N-gram TF-IDF vectorization.

Engineering Goal: Maximize classification accuracy on movie reviews by combining the strengths of linear models (Logistic Regression, SVM) and tree-based models (Random Forest) via soft voting.

Academic Context: Final Project for the "Introduction to Data Science" course at Institut Teknologi Sepuluh Nopember (ITS), Indonesia (2025). Ranked via a private Kaggle competition.

🏗 Architecture & Methodology

This project implements a robust pipeline designed for high-dimensional text data. Unlike simple baselines, it uses advanced feature extraction and model stacking.

Key Steps:

Advanced Preprocessing (src/preprocess.py):
- Noise Removal: Regex-based cleaning to strip URLs, user handles (@user), and non-alphabetic characters.
- Normalization: Lowercasing and whitespace trimming.
Feature Extraction:
- TF-IDF Vectorization: configured with N-grams (1, 3) to capture phrase-level context (e.g., "not good").
- Optimization: Uses sublinear_tf scaling and strict document frequency limits (min_df=3, max_df=0.7) to reduce feature noise.
Ensemble Modeling (src/model.py):
- Voting Classifier (Soft Vote): Combines probability predictions from three distinct models:
  - Logistic Regression: For linear separability and speed.
  - Random Forest: To capture non-linear relationships.
  - SVM (RBF/Linear): For high-dimensional margin maximization.
Evaluation:
- 5-Fold Cross-Validation to ensure model stability.
- Metrics: Accuracy, Precision, Recall, and F1-Score.

📂 Project Structure

├── data/
│   ├── train.csv          # Labelled training dataset
│   └── test.csv           # Unlabelled testing dataset
├── notebooks/
│   └── eda.ipynb          # Exploratory Analysis (Class balance, Text length distribution)
├── src/
│   ├── preprocess.py      # Regex cleaning & TF-IDF N-gram vectorizer
│   ├── model.py           # Ensemble Voting Classifier definition
│   └── predict.py         # Inference logic for generating submissions
├── outputs/
│   └── submission.csv     # Final predictions generated by the pipeline
├── main.py                # Entry point orchestrating the full flow
└── requirements.txt       # Python dependencies

🚀 Getting Started

Prerequisites

Python 3.8+
Key Libraries: scikit-learn, pandas, nltk, textblob

Installation

git clone https://github.com/noecrn/NLP-Sentiment-Analysis-Pipeline.git
cd NLP-Sentiment-Analysis-Pipeline
pip install -r requirements.txt

Usage

Run the full pipeline (Clean Vectorize Train Ensemble Predict):

python main.py

This script will:

Load and clean the training data.
Train the Voting Classifier and print Cross-Validation accuracy.
Generate predictions for test.csv and save them to outputs/submission.csv.

📊 Performance

Vectorization: TF-IDF (Unigrams + Bigrams + Trigrams)
Validation Strategy: 5-Fold Cross-Validation
Target Metric: Accuracy

👤 Author

Noé Cornu - Engineering Student @EPITA - GitHub | LinkedIn
Baptiste Rio - Engineering Student @ EPITA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Sentiment Analysis Pipeline

🏗 Architecture & Methodology

Key Steps:

📂 Project Structure

🚀 Getting Started

Prerequisites

Installation

Usage

📊 Performance

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
notebooks		notebooks
outputs		outputs
src		src
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NLP Sentiment Analysis Pipeline

🏗 Architecture & Methodology

Key Steps:

📂 Project Structure

🚀 Getting Started

Prerequisites

Installation

Usage

📊 Performance

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages