Skip to content

adajing0101/Short-Text-Classification

Repository files navigation

Short-Text-Classification

School Course Contributors

Short-form text classification for Twitter data.

This is the repository for the final project for the Master's level Machine Learning course at The University of Chicago.

Project scope

Build a binary classifier that categorize a given Tweet based on whether or not it is referring to a real-life disaster event.

Project motivation

The field of NLP has evolved rapidly in the past few years, leading to innovations across the analysis pipeline. Through this project, we aim to examine the improvement of these new techniques in applicatoin to a classic NLP dataset, disaster tweets.

Anaysis framework

  1. Data Source

Kaggle Disaster Tweet Classification

  1. Data Exploration

We used LDA topic modeling to cluster and identify key features of the disaster vs. non-disaster tweets.

  1. Embedding

We explore the following embedding techniques:

1. TF-IDF
2. GloVe
3. BERT sentence-level embedding
  1. Modeling

We tested the following modeling options:

1. Naive Bayes (baseline)
2. SVM with linear/non-linear kernal (baseline)
3. CNN
4. Simple RNN
5. RNN with LSTM
6. Bert with max pooling layer
  1. Fine-Tuning

For CNN and RNN arthitectures, we employed the strategies to find the optimized hyper-parameters:

1. Random Search
2. Hyperband
3. Baysian
  1. Conclusion

Overall, we found that the embedding in particular provided the highest performance boost with the lowest cost in efficiency.

Project members

Ada Jing Github ; LinkedIn

Dylan Zhang Github; LinkedIn

Rohit Satishchandra LinkedIn

About

Short-form text binary classification for twitter data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published