An end-to-end Machine Learning pipeline for binary sentiment classification, utilizing an advanced Ensemble Voting strategy and N-gram TF-IDF vectorization.
Engineering Goal: Maximize classification accuracy on movie reviews by combining the strengths of linear models (Logistic Regression, SVM) and tree-based models (Random Forest) via soft voting.
Academic Context: Final Project for the "Introduction to Data Science" course at Institut Teknologi Sepuluh Nopember (ITS), Indonesia (2025). Ranked via a private Kaggle competition.
This project implements a robust pipeline designed for high-dimensional text data. Unlike simple baselines, it uses advanced feature extraction and model stacking.
- Advanced Preprocessing (
src/preprocess.py):- Noise Removal: Regex-based cleaning to strip URLs, user handles (
@user), and non-alphabetic characters. - Normalization: Lowercasing and whitespace trimming.
- Noise Removal: Regex-based cleaning to strip URLs, user handles (
- Feature Extraction:
- TF-IDF Vectorization: configured with N-grams (1, 3) to capture phrase-level context (e.g., "not good").
- Optimization: Uses
sublinear_tfscaling and strict document frequency limits (min_df=3,max_df=0.7) to reduce feature noise.
- Ensemble Modeling (
src/model.py):- Voting Classifier (Soft Vote): Combines probability predictions from three distinct models:
- Logistic Regression: For linear separability and speed.
- Random Forest: To capture non-linear relationships.
- SVM (RBF/Linear): For high-dimensional margin maximization.
- Voting Classifier (Soft Vote): Combines probability predictions from three distinct models:
- Evaluation:
- 5-Fold Cross-Validation to ensure model stability.
- Metrics: Accuracy, Precision, Recall, and F1-Score.
├── data/
│ ├── train.csv # Labelled training dataset
│ └── test.csv # Unlabelled testing dataset
├── notebooks/
│ └── eda.ipynb # Exploratory Analysis (Class balance, Text length distribution)
├── src/
│ ├── preprocess.py # Regex cleaning & TF-IDF N-gram vectorizer
│ ├── model.py # Ensemble Voting Classifier definition
│ └── predict.py # Inference logic for generating submissions
├── outputs/
│ └── submission.csv # Final predictions generated by the pipeline
├── main.py # Entry point orchestrating the full flow
└── requirements.txt # Python dependencies
- Python 3.8+
- Key Libraries:
scikit-learn,pandas,nltk,textblob
git clone https://github.com/noecrn/NLP-Sentiment-Analysis-Pipeline.git
cd NLP-Sentiment-Analysis-Pipeline
pip install -r requirements.txt
Run the full pipeline (Clean Vectorize Train Ensemble Predict):
python main.py
This script will:
- Load and clean the training data.
- Train the Voting Classifier and print Cross-Validation accuracy.
- Generate predictions for
test.csvand save them tooutputs/submission.csv.
- Vectorization: TF-IDF (Unigrams + Bigrams + Trigrams)
- Validation Strategy: 5-Fold Cross-Validation
- Target Metric: Accuracy