- Daniel Oselu
- John Kanoru
- Roseline Maina
- Angela Cheruto
- Benson Muriu
- Irene Maina
- Nelly Ng'eno
- Janet Gachoki
This project uses Multinomial Naive Bayes to predict the sentiment of tweets towards Apple and Google products using data obtained from CrowdFlower found on data.world
In a world where technology startups are common, consumer perception of a brand can provide us with valuable information about their purchasing behavior and, in turn, the financial performance of the business that produces them.In order to determine which brands to research further for potential investment, Longview techventures wants a generalizable model to measure sentiment across various brands.
Longview Techventures is only interested in whether consumers feel positive about the brand and have hired us to help them develop a predictive model that keeps track of recent tweets about tech products so they can make smart investment choices.
This project primarily uses data gathered from CrowdFlower which can be found on data.world or in the data folder in this project's GitHub repository. The data contains 9203 datapoints containing columns containing the full tweet, a column identifying what brand or product the tweet was about if any, and a final column indicating whether it has any emotion positive or negative or none towards the brand or product. The products found by searching keywords were all either Apple or Google products. The data was gathered in 2013 and all the tweets came from the #SXSW tag.
The dataset significant class imbalance with most of the data (61%) being marked as no emotion and only 6% of the data expressing negative views towards the brand or product.
Duplicated rows were dropped and rows that had missing values were either dropped or replaced with Undefined for those in the product column.
Links, punctuation and stopwords were removed from the data prior to modeling. '#' was removed from twitter hashtags, but the content of the tag was kept. The data was small enough that lemmatization was usable to reduce the dimensionality of the data.
To get a general idea of both the word frequencies and product sentiments several data visualizations were made:
The sentiment column was used as the target variable and the tweet column was used as predictor variable. Initial modeling efforts showed that our models suffers from
the class imbalance.
Tradition classification algorithms: Naive Bayes, Random Forest and Logistic Regression were tested as well as the more sophisticated LSTM network. Despite most of the
traditional classification models suffering from training overfit, Multinomial Naive Bayes improves on the overfit slightly but the accuracy is limited to 74%.
Testing additional vectorizers and models may improve on these results. Pre-trained vectorizers such as Google's Word2Vec, Stanford's GloVe, and SpaCy may produce better results.
Scrape for more data to possibly balance the four classes, especially negative and positive tweets.
Build a binary class focus on only the positive and negative comments then add the neutral comments later on.
Obtain more tweet-data from other tech companies, particularly small startups and PR companies.
Please review our full analysis in our Jupyter Notebook or our presentation.
├── data
├── images
├── README.md
├── Tweets_sentiment_analysis.ipynb
└── Twitter_Sentiment_Analysis_Notebook.pdf