This project is organized into the following sections:
- 1_EDA - Initial Exploration, Data Preprocessing, and Visualizations
- 2_Lexicon-Based - Application of Lexicon-Based models to explore the data
- 3_LogReg_RandForest - Application of ML models Logistic Regression and Random Forest
- 4_BERT - Application of the pre-trained model BERT
- 5_LSTM - Application of an LSTM architecture
- 6_XGBoost - Application of the ML model XGBoost
- Sections 1 & 2: Focus on data exploration and preprocessing.
- Sections 3 to 6: Focus on applying various models for predictions.
This repository contains several Python files, including:
preprocessing.pyandlexicon_based.py: Functions used throughout the project.lstm_helper.pyandlstm_model.py: Functions specifically used in Section 6.
The initial dataset, airlines_reviews.csv, was preprocessed and exported as processed_data.csv. The preprocessed data was then used as input for the notebooks in Sections 3 to 6. XGBoost_misclssified_smples.csv was the neutral reviews misclassified by XGBoost.
Tasks were divided among group members, but everyone was available to answer questions and resolve issues that arose during the project.