Course code : CMPE-257
Group name : Codebusters
Name: Harini Balakrishnan (010830755)
GitHub URL: https://github.com/HariniGB/AlternusVera
Fake news problem is too important to ignore especially after recent election of Donald Trump. These news are like malignant tumor that causes moral treats. It is nothing but a nassault on truth. Being impartial about the real news are same as that of a deliberate lies.
The main purpose of this project is to identify essential features that can be trusted to predict if a news is fake or not. These features are realted to the news content. Our key classification is to predict if a news is fake or not based on these features. In addition, we are also intended to learn various deep learning and neural networking techniques and compare their performances.
We initially as a team performend literature survey on list of features that has so far played major role in popularising a fake news. Each member of the team took one major feature and performed distillation process. We vectorized and came up with embedding vectors. Computed polynomial equation and classified the news articles based on all the below features.
-
It has 3 files test, training and valid.
-
Each file has 14 columns
Column 1: the ID of the statement ([ID].json).
Column 2: the label.
Column 3: the statement.
Column 4: the subject(s).
Column 5: the speaker.
Column 6: the speaker's job title.
Column 7: the state info.
Column 8: the party affiliation.
Column 9-13: the total credit history count, including the current statement.
Column 14: the context (venue / location of the speech or statement).
- Load the Data
- Distillation Process
- Data Cleaning and Text Preprocessing
- Visualization
- Feature 1 : Sentiment Analysis using Vader
- Feature 2 : LDA Topic Modeling
- Feature 3 : Sensationalism Analysis Cosine Similarity
- Counter Vectorization and classification models
- TF-IDF Vectorization and classification models
- Word2Vec word embedding and TSNE Visualization
- Doc2Vec Tagging and classification models
- Compare Counter vs TF-IDF vs Doc2Vec
- Vector Classification Modeling
- Ranking and Importance
Classification algorithms: - Naives Bayes regression - Logistic regression - SVM Stochastic Gradient Descent - Linear SVM Classifier - RandomForestClassifier
Top Features Selected based on research articles
- Political Affiliation
- Sensationalism
- Click bait
- Context Modeling
- Spam
Other simple features as a part of distillation:
- Sentiment Analysis
- LDA Topic Modeling
- Ranking
| Features | Member |
|---|---|
| Sensationalism | Harini Balakrishnan |
| Political Affiliation | Anushri Srinath Aithal |
| Context Modeling | Sunder Thyagarajan |
| Clickbait | Ravi Katta |
| Spam | All Memebers |
| Versions | Features | DataEnrichment & Corpus | Explanation |
|---|---|---|---|
| Part 1 | Sentiment Analysis | Vader Sentiment Intensity Analyses, SenticNet5 | Tokenization, Normalization, Stemming, CounterVectorization, TF-IDF Vectorization, Doc2Vec and classification models |
| Part 2 | Distillation | GoogleNews-vectors-negative300 | Stop words, Lemmentization, Spell Check, LDA Topic Modeling, word2vec, lda2vec |
| Part 3 | Sensationalism | The Persuasion Revolution Website | Doc2Vec, word2vec, TF-IDF Vectorization, Cosine Similarity, Compared three vectorization models |
- SenticNet5 for sentiment and sensationalism corpus
- Google News 3million words corpus for spell check
- NLTK
- Gensim
- Numpy
- Pandas
- CSV
- WordCloud
- Seaborn
- Scipy
- Regualr Expression
- Matplotlib
- Sklearn
Initially I preprocessed the given dataset using NLTK in-build libraries for tokenization, stopwords removal, stemming and lemmentization. Then I decided to visualize the cleaned data using WordCloud. I decided to extract compound features like Sentiment, Sensationalism and LDA Topic score and utilized it to classify the news document as fake or not. I tried three methods.First I tired CountVectorizer with which I got 95% for sentiment and 63.45% for sensationalism. Next I tried TfidfVectorizer which gave me 97% accuracy for sentiment and 61% for sensationalism. Finally I tried Doc2Vec which gave 50% for sensationalism and 91% accuracy for sentiment feature.
As I was taking the sentiment intensity of each word and aggregating the words to get single vector for each document, I tired to get a vector for each word. I tried Word2Vec Alternus Vera Paper (Sentiment analysis) - Draft 1 library. There are two types of architecture options in Word2Vec: skip-gram (default) and CBOW (continuous bag of words). Most of time, skip-gram is little bit slower but has more accuracy than CBOW. CBOW is the method to predict one word by whole text; therefore, small set of data is more favorable. On the other hand, skip-gram is totally opposite to CBOW. With the target word, skip-gram is the method to predict the words around the target words. The more data we have, the better it performs. As the architecture, there are two training algorithms for Word2Vec: Hierarchical softmax (default) and negative sampling. I used the default. The Word2Vec provides vectorization for each word. Which causes dimensional issues with the 'senti_word_vector' column of each document. I performed Vector Averaging which gave me single vector for each document but with different length of the vector.
I couldn’t perform classification model as the dimension of the vector varies for each document. My solution is to try Doc2Vec which will provide a vector of fixed size for the entire document instead of each word in the entire corpus. In the word2vec architecture, the two algorithm names are “continuous bag of words” (CBOW) and “skip-gram” (SG); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (DM) and “distributed bag of words” (DBOW).I performed Doc2Vec using all the three features to predict if a document is fake or not and obtained 56% accuracy.
- CountVectorizer
| Models | Accuracy |
|---|---|
| Naive Bayes | 92.1% |
| Logistic Regression | 94.8% |
| Linear SVM classifier | 95% |
| Stochastic Gradient Descent | 94.1% |
| Randome Forest Classifier | 92.0% |
- TF-IDF Vectorizer
| Models | Accuracy |
|---|---|
| Naive Bayes | 95.8% |
| Logistic Regression | 95.8% |
| Linear SVM classifier | 96.7% |
| Stochastic Gradient Descent | 95.8% |
| Randome Forest Classifier | 96.1% |
- Doc2Vec
| Models | Accuracy |
|---|---|
| Logistic Regression | 91.1% |
| Linear SVM classifier | 91.1% |
| Stochastic Gradient Descent | 91.1% |
| Randome Forest Classifier | 83.9% |
- CountVectorizer
| Models | Accuracy |
|---|---|
| Naive Bayes | 59.9% |
| Logistic Regression | 63% |
| Linear SVM classifier | 63.45% |
| Stochastic Gradient Descent | 63.4% |
| Randome Forest Classifier | 61% |
- TF-IDF Vectorizer
| Models | Accuracy |
|---|---|
| Naive Bayes | 56.4% |
| Logistic Regression | 60 % |
| Linear SVM classifier | 61% |
| Stochastic Gradient Descent | 48 % |
| Randome Forest Classifier | 57 % |
- Doc2Vec
| Models | Accuracy |
|---|---|
| Logistic Regression | 49.3 % |
| Linear SVM classifier | 49.4% |
| Stochastic Gradient Descent | 48.4 % |
| Randome Forest Classifier | 49.8 % |
The Count vectorization Linear SVM outnumbered both the TF-IDF and Doc2Vec because CountVectorization performed binary vectorization of words. Whereas TF-IDF takes probabilistic approach and gives more accurate score for each word. Doc2Vec wasn't rich or big enough for the actual news because of the limited content which wasn't enough for the model to understand to generate sensible embedding.
As my team we decided to classify the final vectorization using Doc2Vec. With 56% accuracy on sensationalism, we decided to provide 0.1 scalar weight for this feature in the polynomial equation.
As a team, we decided on the importance of the factors presented above. We brainstormed on the general pre-processing techniques we did want to use. We also had common visualization methods and similar techniques for evaluating the classification model accuracy. Each of us enriched the dataset with individual features and persisted it in a csv file. Each feature vector is persisted on csv (which is distilled with LDA, sentiment scores). We also came up with a polynomial equation based on the factors and the accuracy scores we received by classification. The polynomial equation is then used to build a model for fake news classification. The polynomial equation that we have used is (0.6Clickbait) + (0.1 Sensationalism) + (0.2Political Affiliation Feature) + (0.1Context). The final model that we built is a variation of the stack ensemble technique. Stacked generalization is an ensemble method where the models are combined using another machine learning algorithm. The basic idea is to train machine learning algorithms with training dataset and then generate a new dataset with these models. Then this new dataset is used as input for the combined machine learning algorithm. The combined model is then used to predict the fakeness in the corpus. We as a team were able to achieve an accuracy of 57% using the various supervised learning techniques specified in this paper.