Wrangling Tweets from WeRateDogs Account

by Shan Jafri

The wrangling report and analysis report are here and here

Introduction

In a world where data is abundent, gathering and cleaning data is necessary for any data scientist to extract valuable insights. The focus of this project is to gather data from multiple sources, assess dirty data and then clean the data. The final part of the project runs an analysis on the clean dataset.

Gather

In this project, data was gathered from three sources:

The WeRateDogs twitter archive that was provided by Udacity as a csv file.
The image prediction file that was hosted by Udacity serves and downloaded programmatically using request library
Using twitters API and python's tweepy library, any missing information from WeRateDogs twiiter archive such as retweet counts and favorite counts are gathered. The tweet IDs in the WeRateDogs Twitter archive will be use to query the Twitter API for each JSON structure tweet using Python's Tweepy library and in a file called tweet_json.txt file. Each JSON formatted tweet should be written to its own line in a text file and then read line by line into a pandas DataFrame.

Assess

The gathered data has to be assessed for the range of the cleaning issues required. When assessing, the type of cleaning is broken into two categories, tidy and quality data issues. Below are each data issue that need to be address.

Quality

Twitter_archive

tweet_id should be a string instead of an integer
timestamp column should be datetime data type
'Name' columns have incorrect names like 'a', 'such', 'None'
rename columns 'text' = 'tweet_text', 'name' = 'dog_name', 'source' = 'source_application'
in_reply_to_status_id/user_id, retweeted_status_id/user_id should be srings instead of float
ratings_numerator/denominator, retweets, and favorites should be integers instead of floats

Image_prediction

Tweet_id should be a string instead of an integer
Missing values from images dataset only 2075 entires
Some tweet_ids have the same jpg_url
Source columncan be cleaned

json dataframe

remove any duplicate tweet_id's

Tidiness

Merge all data tables together
Dog description in columns instead of rows
Too many columns in image prediciton could condense
Only interested in unique ratings, no retweets
Drop unnecessary columns and NaN rows
retweeted_status_id/user_id should be dropped

Cleaning tidy data issues first is often the best way to start. When structral data issues are solved, cleaning quality data issues become simpler.

Conclusion

Data wrangling skills are fundmental to anyone working with data. Nearly 80% of a data scientist time is spent wrangling data. If data wrangling is properly done, tons of time can be saved during analysis.

The wrangling report and analysis report are here and here

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md
act_report.ipynb		act_report.ipynb
image-predictions.tsv		image-predictions.tsv
tweet_json.txt		tweet_json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_report.html		wrangle_report.html
wrangle_report.ipynb		wrangle_report.ipynb
wrangling_act.ipynb		wrangling_act.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wrangling Tweets from WeRateDogs Account

Introduction

Gather

Assess

Quality

Twitter_archive

Image_prediction

json dataframe

Tidiness

Conclusion

About

Uh oh!

Releases

Packages

Languages

shape335/wrangling_dogs

Folders and files

Latest commit

History

Repository files navigation

Wrangling Tweets from WeRateDogs Account

Introduction

Gather

Assess

Quality

Twitter_archive

Image_prediction

json dataframe

Tidiness

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages