Skip to content

SebastianDeghi/proyect_ml_eng

Repository files navigation

๐ŸŽฌ IMDB Sentiment Analysis โ€” From NLP to Production

License: MIT Python 3.8+ scikit-learn MLOps FastAPI Docker Tests Views

๐Ÿ‘ฅ Contributors: Emmanuel Gonzalez Gomez & Dalma Mรกrquez


๐Ÿ“ Overview

This project performs binary sentiment classification (positive/negative) on 50,000 IMDB movie reviews.

It combines:

  • Traditional NLP techniques (TF-IDF)
  • Dense semantic representations (Word2Vec embeddings)
  • Multiple machine learning models
  • Production deployment using FastAPI

The project evolves from a Data Science workflow into a Machine Learning Engineering solution, where the best model is deployed as a REST API with Docker containerization and CI/CD pipelines.


๐Ÿ“ Project Structure

proyect_ml_eng/
โ”œโ”€โ”€ ๐Ÿ“‚ .github/workflows/
โ”‚ โ””โ”€โ”€ ci_cd.yml                      # CI/CD pipeline with GitHub Actions
โ”œโ”€โ”€ ๐Ÿ“‚ ml_pipeline/
โ”‚ โ”œโ”€โ”€ init.py
โ”‚ โ”œโ”€โ”€ train_model.py                 # Model training script
โ”‚ โ”œโ”€โ”€ predict.py                     # Prediction module
โ”‚ โ”œโ”€โ”€ config.yaml                    # Configuration file
โ”‚ โ”œโ”€โ”€ ๐Ÿ“‚ models/
โ”‚ โ”‚ โ”œโ”€โ”€ model.pkl                    # Saved models (generated)
โ”‚ โ”‚ โ””โ”€โ”€ vectorizer.pkl
โ”‚ โ”œโ”€โ”€ ๐Ÿ“‚ api/
โ”‚ โ”‚ โ”œโ”€โ”€ init.py
โ”‚ โ”‚ โ”œโ”€โ”€ app.py                      # FastAPI application
โ”‚ โ”‚ โ””โ”€โ”€ Dockerfile                  # Containerization
โ”‚ โ””โ”€โ”€ ๐Ÿ“‚ tests/
โ”‚ โ”œโ”€โ”€ init.py
โ”‚ โ””โ”€โ”€ test_predict.py               # Unit tests (17 tests)
โ”œโ”€โ”€ ๐Ÿ“‚ notebooks/
โ”‚ โ””โ”€โ”€ IMDB_NLP_Sentiment_Analysis.ipynb # Original EDA and modeling
โ”œโ”€โ”€ ๐Ÿ“‚ examples/
โ”‚ โ”œโ”€โ”€ api_examples.py               # API usage examples
โ”‚ โ””โ”€โ”€ notebook_usage.ipynb          # Notebook usage example
โ”œโ”€โ”€ ๐Ÿ“‚ scripts/
โ”‚ โ”œโ”€โ”€ download_dataset.py           # Dataset download utility
โ”‚ โ””โ”€โ”€ benchmark_model.py            # Performance benchmarks
โ”œโ”€โ”€ ๐Ÿ“‚ monitoring/
โ”‚ โ””โ”€โ”€ prometheus.yml                # Prometheus configuration
โ”œโ”€โ”€ requirements.txt                # Production dependencies
โ”œโ”€โ”€ requirements-dev.txt            # Development dependencies
โ”œโ”€โ”€ docker-compose.yml              # Multi-container setup
โ”œโ”€โ”€ Makefile                        # Common commands
โ”œโ”€โ”€ pyproject.toml                  # Project configuration
โ”œโ”€โ”€ .pre-commit-config.yaml         # Pre-commit hooks
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ””โ”€โ”€ README.md

๐Ÿ“Š Dataset

The IMDB Dataset of 50K Movie Reviews contains labeled movie reviews.

  • Size: 50,000 reviews
  • Distribution: Balanced (25k positive / 25k negative)
  • Features:
    • review: raw text
    • sentiment: label
  • Task: Binary classification

Source: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews


๐Ÿš€ Models and Representations

Text Representations

Representation Type Description
TF-IDF Sparse Statistical weighting based on term frequency
Word2Vec Dense Semantic embeddings trained with Skip-gram

Models Evaluated

Representation Model Accuracy Precision Recall F1-score
TF-IDF Logistic Regression 0.8944 0.8865 0.9046 0.8955
LinearSVC 0.8920 0.8940 0.8930 0.8937
Naive Bayes 0.8670 0.8700 0.8680 0.8688
Word2Vec MLP 0.8810 0.8830 0.8820 0.8825
SVC (RBF) 0.8800 0.8820 0.8810 0.8810
Logistic Regression 0.8710 0.8730 0.8720 0.8720
Random Forest 0.8640 0.8660 0.8650 0.8650

๐Ÿง  Key Insights

  • TF-IDF + Logistic Regression provides the best performance (F1: 0.8955)
  • Linear models work best with sparse representations
  • Non-linear models benefit from dense embeddings
  • Word2Vec captures semantic relationships better but is slightly less accurate here

โš ๏ธ Known Limitations

  • Does not handle neutral sentiment (binary only)
  • TF-IDF ignores word order and context
  • Vocabulary limited to training set (new words become zero vectors)
  • Inference time: ~15ms per request on CPU

โš™๏ธ Production API (FastAPI)

The best model (TF-IDF + Logistic Regression) is deployed as a REST API with the following endpoints:

Endpoint Method Description
/health GET Health check
/predict POST Single sentiment prediction
/batch POST Batch sentiment prediction
/docs GET Interactive API documentation
/info GET Model information and metrics

๐Ÿ”ง Installation

Prerequisites

  • Python 3.8+
  • pip
  • (Optional) Docker Desktop

Clone the repository

git clone https://github.com/SebastianDeghi/proyect_ml_eng.git
cd proyect_ml_eng

Create virtual environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

๐Ÿš€ Usage

1. Train the model

python ml_pipeline/train_model.py

This will:

  • Download the IMDB dataset
  • Preprocess 50,000 reviews
  • Train TF-IDF vectorizer and Logistic Regression
  • Save artifacts to ml_pipeline/models/

2. Run tests

pytest ml_pipeline/tests/ -v

Expected output: 17 passed

3. Test predictions

python -c "from ml_pipeline.predict import predict_sentiment, load_model_and_vectorizer; m, v = load_model_and_vectorizer(); print('Positive:', predict_sentiment('Great movie!', m, v)['sentiment']); print('Negative:', predict_sentiment('Awful film', m, v)['sentiment'])"

4. Run API locally

cd ml_pipeline/api
uvicorn app:app --reload

Open in your browser:

๐Ÿ‘‰ http://127.0.0.1:8000/docs

5. Test the API (with PowerShell)

# Positive review
$response = Invoke-WebRequest -Uri http://localhost:8000/predict -Method POST -Body '{"text":"This movie is absolutely amazing!"}' -ContentType "application/json"
$response.Content | ConvertFrom-Json

# Negative review
$response = Invoke-WebRequest -Uri http://localhost:8000/predict -Method POST -Body '{"text":"Terrible film, waste of time."}' -ContentType "application/json"
$response.Content | ConvertFrom-Json

6. Run API examples

python examples/api_examples.py

๐Ÿณ Docker

Build the image

docker build -t imdb-api -f ml_pipeline/api/Dockerfile .

Run the container

docker run -p 8000:8000 imdb-api

Using Docker Compose

docker-compose up --build

๐Ÿ”ฎ API Endpoint

POST /predict

Request

{
  "text": "This movie was absolutely amazing!"
}

Response

{
  "sentiment": "positive",
  "confidence": 0.9876,
  "text_length": 42
}

POST /batch

Request

{
  "texts": [
    "Great movie!",
    "Awful film.",
    "It was okay."
  ]
}

Response

{
  "results": [
    {"sentiment": "positive", "confidence": 0.95, "text_length": 12},
    {"sentiment": "negative", "confidence": 0.92, "text_length": 11},
    {"sentiment": "negative", "confidence": 0.65, "text_length": 13}
  ],
  "total_count": 3
}

GET /health

Response

{
  "status": "healthy",
  "model_loaded": true,
  "version": "1.0.0"
}

๐Ÿงช Inference Pipeline

Raw Text โ†’ Preprocessing โ†’ TF-IDF Vectorization โ†’ Logistic Regression โ†’ Sentiment Prediction

Preprocessing steps:

  1. Lowercase conversion
  2. Remove non-alphabetic characters
  3. Tokenization
  4. Stopwords removal
  5. Lemmatization

๐Ÿ—บ๏ธ Roadmap

  • TF-IDF + Logistic Regression baseline
  • FastAPI deployment
  • Docker containerization
  • Unit tests (17 tests passing)
  • Batch prediction endpoint
  • MLflow experiment tracking
  • Prometheus monitoring
  • GitHub Actions CI/CD
  • Cloud deployment (Render/AWS)

๐Ÿ“Š Performance Metrics

Metric Value
Accuracy 89.44%
Precision 88.65%
Recall 90.46%
F1-score 89.55%
Inference Time ~15ms per request
Training Time ~45 seconds

๐Ÿ“š References


๐Ÿ™ Acknowledgements

  • Lakshmi N Pathi for making the dataset publicly available on Kaggle
  • The open-source community for essential tools:
    • scikit-learn
    • nltk
    • gensim
    • pandas
    • numpy
    • fastapi
    • uvicorn

๐Ÿ“„ License

MIT License - see the LICENSE file for details.


๐Ÿ‘ค Author

Sebastiรกn Deghi

About

Binary sentiment classification on IMDB 50K reviews. Compares TF-IDF (sparse) vs Word2Vec (dense) representations. Models: Logistic Regression, SVM, Random Forest, MLP, Naive Bayes. Includes data preprocessing, hyperparameter tuning (GridSearchCV), and model evaluation metrics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors