This project is focused on solving the Kaggle competition to classify tweets as disaster or non-disaster tweets using Natural Language Processing (NLP). The solution is built using Python and deep learning librarie PyTorch.
Link for Kaggle competition: https://www.kaggle.com/competitions/nlp-getting-started
-
kaggle_nlp/main.py: The main script for model training and prediction. It includes:- Data loading and preprocessing
- Vocabulary building
- Model training and evaluation (BiLSTM)
- Saving/loading model and vocabulary
- Command-line interface for both training and prediction
-
kaggle_nlp/utils/utilities.py: Utility functions for data preprocessing, such as:- Cleaning and formatting keywords
- Combining keyword and text fields
- Removing unnecessary characters and links from tweets
-
kaggle_nlp/predict_test.py: (Legacy) Script for making predictions on the test dataset. The main workflow is now inmain.py. -
kaggle_nlp/data/: Containstrain.csvandtest.csvdatasets. -
kaggle_nlp/model/: Stores trained model weightsbilstm.ptand vocabularyvocab.json. -
requirements.txt: Python dependencies for the project.
-
Install Dependencies: Clone the repository and install dependencies (using pip or poetry):
git clone https://github.com/serverdaun/kaggle_nlp cd kaggle_nlp pip install -r requirements.txt -
Download Data: Download the competition data and place
train.csvandtest.csvin thekaggle_nlp/data/directory. -
Model Training: Run the following command to train the model:
python -m kaggle_nlp.main train --train_csv kaggle_nlp/data/train.csv --model_dir kaggle_nlp/model --epochs 10
- Model weights and vocabulary will be saved in
kaggle_nlp/model/.
- Model weights and vocabulary will be saved in
-
Predictions: Run the following command to generate predictions on the test set:
python -m kaggle_nlp.main predict --test_csv kaggle_nlp/data/test.csv --model_dir kaggle_nlp/model
- The predictions will be saved as
predictions.csvin the project root.
- The predictions will be saved as
This project is inspired by Kaggle’s Disaster Tweets competition. It leverages PyTorch for model implementation. Special thanks to the open-source community for providing tools that enable seamless model training and evaluation.