Skip to content

weiziyuan/Youtube-comments-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Youtube-comments-study in Apache Spark

For complete code in Spark, please click here

Overview

In this project, we analyzed a dataset of user comments on youtube videos related to pets(cats and dogs). We hope to identify users with/without pets and topics interesting to them.

Exploratory Data Analysis

  • Remove the missing values
  • Label the data

Specifically, we identify users commenting like 'my dog', 'I have a dog' , 'I have a cat' and etc as pets owners. Of course, we might miss some owners. By trainign a model, we hope to identify these users.

  • Converted the comments texts to feature vectors using RegTokenizer and Word2Vec in Spark ML

Word2vec take a large corpus of text as it input and produces a vector space, which is typically of several hundred dimensions. Eachunique word in the corpus is assigned a corresponding vector in the space. Similar words are close in the vector space, making generalization to novel patterns easier and model estimation more robust.

More about Word2vec : click here

Train Models

  • Model selection: Logistic Regression (LR), Random Foerest (RF) and Graident Boosting Tree(GBT)
  • Peformance

  • Confusion matrix

alt text

  • ROC curve

Hyperparameters tuning was implemented with gridsearch and crossvalidation.

Overall, AUC: RF>LR>GBT. Accuracy: RF>LR>GBT

Model Application

We will choose random forest model to classify all the users as we see it has the best performance.

  • Identify Users

About 11% of the total users own pets, in other words, most users that contribute to the comments/video click don't have a cat or dog. Therefore, I believe it would be quite helpful to find common topics among these users.

  • Find Interesting Topics

Extract features with CountVectorizer(convert a collection of text documents to vectors of token counts) and StopWordsRemover(exclude words that don't carry much meaningful information).Then, we train the model with LDA algorithm to obtain important topics of target users.

More about CountVectorizer and StopWordsRemover :click here

-Latent Dirichlet Algolocation(LDA)

LDA is an important algorithm used for topic modelling.

Ideas behind LDA: each document is a mixture of topics and each topics is a mixture of words.

More about LDA : click here

-Topics that are interesting to non owners

  1. how they love the videos.

  2. how cute these pets are

  3. interest in coyote.

-Topics that are interesting to owners

  1. how they love their pets.

  2. get one more pet.

  • Identify top 10 creators (who have the most pets owners commented on their channel)

  • Identify top 10 active pets owners (who comment the most)

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published