This project performs binary sentiment classification (positive/negative) on 50,000 IMDB movie reviews.
It combines:
- Traditional NLP techniques (TF-IDF)
- Dense semantic representations (Word2Vec embeddings)
- Multiple machine learning models
- Production deployment using FastAPI
The project evolves from a Data Science workflow into a Machine Learning Engineering solution, where the best model is deployed as a REST API with Docker containerization and CI/CD pipelines.
proyect_ml_eng/
โโโ ๐ .github/workflows/
โ โโโ ci_cd.yml # CI/CD pipeline with GitHub Actions
โโโ ๐ ml_pipeline/
โ โโโ init.py
โ โโโ train_model.py # Model training script
โ โโโ predict.py # Prediction module
โ โโโ config.yaml # Configuration file
โ โโโ ๐ models/
โ โ โโโ model.pkl # Saved models (generated)
โ โ โโโ vectorizer.pkl
โ โโโ ๐ api/
โ โ โโโ init.py
โ โ โโโ app.py # FastAPI application
โ โ โโโ Dockerfile # Containerization
โ โโโ ๐ tests/
โ โโโ init.py
โ โโโ test_predict.py # Unit tests (17 tests)
โโโ ๐ notebooks/
โ โโโ IMDB_NLP_Sentiment_Analysis.ipynb # Original EDA and modeling
โโโ ๐ examples/
โ โโโ api_examples.py # API usage examples
โ โโโ notebook_usage.ipynb # Notebook usage example
โโโ ๐ scripts/
โ โโโ download_dataset.py # Dataset download utility
โ โโโ benchmark_model.py # Performance benchmarks
โโโ ๐ monitoring/
โ โโโ prometheus.yml # Prometheus configuration
โโโ requirements.txt # Production dependencies
โโโ requirements-dev.txt # Development dependencies
โโโ docker-compose.yml # Multi-container setup
โโโ Makefile # Common commands
โโโ pyproject.toml # Project configuration
โโโ .pre-commit-config.yaml # Pre-commit hooks
โโโ .gitignore
โโโ LICENSE
โโโ README.md
The IMDB Dataset of 50K Movie Reviews contains labeled movie reviews.
- Size: 50,000 reviews
- Distribution: Balanced (25k positive / 25k negative)
- Features:
review: raw textsentiment: label
- Task: Binary classification
Source: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
| Representation | Type | Description |
|---|---|---|
| TF-IDF | Sparse | Statistical weighting based on term frequency |
| Word2Vec | Dense | Semantic embeddings trained with Skip-gram |
| Representation | Model | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|---|
| TF-IDF | Logistic Regression | 0.8944 | 0.8865 | 0.9046 | 0.8955 |
| LinearSVC | 0.8920 | 0.8940 | 0.8930 | 0.8937 | |
| Naive Bayes | 0.8670 | 0.8700 | 0.8680 | 0.8688 | |
| Word2Vec | MLP | 0.8810 | 0.8830 | 0.8820 | 0.8825 |
| SVC (RBF) | 0.8800 | 0.8820 | 0.8810 | 0.8810 | |
| Logistic Regression | 0.8710 | 0.8730 | 0.8720 | 0.8720 | |
| Random Forest | 0.8640 | 0.8660 | 0.8650 | 0.8650 |
- TF-IDF + Logistic Regression provides the best performance (F1: 0.8955)
- Linear models work best with sparse representations
- Non-linear models benefit from dense embeddings
- Word2Vec captures semantic relationships better but is slightly less accurate here
- Does not handle neutral sentiment (binary only)
- TF-IDF ignores word order and context
- Vocabulary limited to training set (new words become zero vectors)
- Inference time: ~15ms per request on CPU
The best model (TF-IDF + Logistic Regression) is deployed as a REST API with the following endpoints:
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/predict |
POST | Single sentiment prediction |
/batch |
POST | Batch sentiment prediction |
/docs |
GET | Interactive API documentation |
/info |
GET | Model information and metrics |
- Python 3.8+
- pip
- (Optional) Docker Desktop
git clone https://github.com/SebastianDeghi/proyect_ml_eng.git
cd proyect_ml_eng# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python -m venv venv
source venv/bin/activatepip install -r requirements.txtpython ml_pipeline/train_model.pyThis will:
- Download the IMDB dataset
- Preprocess 50,000 reviews
- Train TF-IDF vectorizer and Logistic Regression
- Save artifacts to
ml_pipeline/models/
pytest ml_pipeline/tests/ -vExpected output: 17 passed
python -c "from ml_pipeline.predict import predict_sentiment, load_model_and_vectorizer; m, v = load_model_and_vectorizer(); print('Positive:', predict_sentiment('Great movie!', m, v)['sentiment']); print('Negative:', predict_sentiment('Awful film', m, v)['sentiment'])"cd ml_pipeline/api
uvicorn app:app --reloadOpen in your browser:
๐ http://127.0.0.1:8000/docs
# Positive review
$response = Invoke-WebRequest -Uri http://localhost:8000/predict -Method POST -Body '{"text":"This movie is absolutely amazing!"}' -ContentType "application/json"
$response.Content | ConvertFrom-Json
# Negative review
$response = Invoke-WebRequest -Uri http://localhost:8000/predict -Method POST -Body '{"text":"Terrible film, waste of time."}' -ContentType "application/json"
$response.Content | ConvertFrom-Jsonpython examples/api_examples.pydocker build -t imdb-api -f ml_pipeline/api/Dockerfile .docker run -p 8000:8000 imdb-apidocker-compose up --buildPOST /predict
{
"text": "This movie was absolutely amazing!"
}{
"sentiment": "positive",
"confidence": 0.9876,
"text_length": 42
}POST /batch
{
"texts": [
"Great movie!",
"Awful film.",
"It was okay."
]
}{
"results": [
{"sentiment": "positive", "confidence": 0.95, "text_length": 12},
{"sentiment": "negative", "confidence": 0.92, "text_length": 11},
{"sentiment": "negative", "confidence": 0.65, "text_length": 13}
],
"total_count": 3
}GET /health
{
"status": "healthy",
"model_loaded": true,
"version": "1.0.0"
}Raw Text โ Preprocessing โ TF-IDF Vectorization โ Logistic Regression โ Sentiment Prediction
Preprocessing steps:
- Lowercase conversion
- Remove non-alphabetic characters
- Tokenization
- Stopwords removal
- Lemmatization
- TF-IDF + Logistic Regression baseline
- FastAPI deployment
- Docker containerization
- Unit tests (17 tests passing)
- Batch prediction endpoint
- MLflow experiment tracking
- Prometheus monitoring
- GitHub Actions CI/CD
- Cloud deployment (Render/AWS)
| Metric | Value |
|---|---|
| Accuracy | 89.44% |
| Precision | 88.65% |
| Recall | 90.46% |
| F1-score | 89.55% |
| Inference Time | ~15ms per request |
| Training Time | ~45 seconds |
- https://aclanthology.org/P11-1015.pdf
- https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- Lakshmi N Pathi for making the dataset publicly available on Kaggle
- The open-source community for essential tools:
scikit-learnnltkgensimpandasnumpyfastapiuvicorn
MIT License - see the LICENSE file for details.
Sebastiรกn Deghi
- GitHub: https://github.com/SebastianDeghi
- LinkedIn: https://www.linkedin.com/in/sebastian-deghi/
- Google Scholar: https://scholar.google.com/citations?user=3Nq5hTIAAAAJ&hl=en