Final project submission for Natural Language Processing course at the UC Berkeley School of Information, and grand prize winning submission for the 2016 Wells Fargo Campus Analytics Challenge.
The challenge provided a dataset of social media messages about four major banks and asked the question: What are banking customers saying on social media?
Our approach to the problem was a fourstep process:
- Clean data by removing irrelevant messages and common words resulting from data preprocessing (i.e. NAME, ADDRESS)
- Identify main topics being discussed with bigram collocations
- Cross-tabulate messages by topic and bank
- Use Latent Dirichlet Allocation (LDA) clustering to further separate and identify substance of messages
Submitted by Paul Glenn and Vijay Velagapudi