Youtube-comments-study in Apache Spark

For complete code in Spark, please click here

Overview

In this project, we analyzed a dataset of user comments on youtube videos related to pets(cats and dogs). We hope to identify users with/without pets and topics interesting to them.

Exploratory Data Analysis

Remove the missing values
Label the data

Specifically, we identify users commenting like 'my dog', 'I have a dog' , 'I have a cat' and etc as pets owners. Of course, we might miss some owners. By trainign a model, we hope to identify these users.

Converted the comments texts to feature vectors using RegTokenizer and Word2Vec in Spark ML

Word2vec take a large corpus of text as it input and produces a vector space, which is typically of several hundred dimensions. Eachunique word in the corpus is assigned a corresponding vector in the space. Similar words are close in the vector space, making generalization to novel patterns easier and model estimation more robust.

More about Word2vec : click here

Train Models

Model selection: Logistic Regression (LR), Random Foerest (RF) and Graident Boosting Tree(GBT)
Peformance

Confusion matrix

ROC curve

Hyperparameters tuning was implemented with gridsearch and crossvalidation.

Overall, AUC: RF>LR>GBT. Accuracy: RF>LR>GBT

Model Application

We will choose random forest model to classify all the users as we see it has the best performance.

Identify Users

About 11% of the total users own pets, in other words, most users that contribute to the comments/video click don't have a cat or dog. Therefore, I believe it would be quite helpful to find common topics among these users.

Find Interesting Topics

Extract features with CountVectorizer(convert a collection of text documents to vectors of token counts) and StopWordsRemover(exclude words that don't carry much meaningful information).Then, we train the model with LDA algorithm to obtain important topics of target users.

More about CountVectorizer and StopWordsRemover :click here

-Latent Dirichlet Algolocation(LDA)

LDA is an important algorithm used for topic modelling.

Ideas behind LDA: each document is a mixture of topics and each topics is a mixture of words.

More about LDA : click here

-Topics that are interesting to non owners

how they love the videos.
how cute these pets are
interest in coyote.

-Topics that are interesting to owners

how they love their pets.
get one more pet.

Identify top 10 creators (who have the most pets owners commented on their channel)

Identify top 10 active pets owners (who comment the most)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Images		Images
README.md		README.md
Youtube data analysis.ipynb		Youtube data analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Youtube-comments-study in Apache Spark

Overview

Exploratory Data Analysis

Train Models

Model Application

About

Uh oh!

Releases

Packages

Languages

weiziyuan/Youtube-comments-study

Folders and files

Latest commit

History

Repository files navigation

Youtube-comments-study in Apache Spark

Overview

Exploratory Data Analysis

Train Models

Model Application

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages