Project: Wrangle and Analyze WeRateDogs Twitter Archive

Udacity Data Analyst Nanodegree

Project: Wrangle and Analyze WeRateDogs Twitter Archive

Table of Content

Introduction
What Softwares Do I Need?
Project Motivation
Project Details

Gathering Data
Assessing Data
Cleaning Data
Storing, Analyzing and Visualizing Data

Report

Introduction

Real-world data rarely comes clean. Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. I will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

What Softwares Do I Need?

The following softwares and packages (libraries) are required for this project:

Softwares
- Jupyter Notebook
- Microsoft Word
- Google Docs
Packages (Libraries)
- pandas
- NumPy
- matplotlib
- seaborn
- requests
- tweepy
- json

Project Motivation

Context

My goal for this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

The Data

Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which Udacity used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, Udacity filtered for tweets with ratings only (there are 2356).

Udacity extracted this data programmatically, but didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I'll need to assess and clean these columns in order to use them for analysis and visualization.

Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But me, because I have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? I'm going to query Twitter's API to gather this valuable data.

Image Predictions File

One more cool thing: Udacity ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

So for the last row in that table:

tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
p1_conf is how confident the algorithm is in its #1 prediction → 95%
p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
p2 is the algorithm's second most likely prediction → Labrador retriever
p2_conf is how confident the algorithm is in its #2 prediction → 1%
p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
etc.

And the #1 prediction for the image in that tweet was spot on:

So that's all fun and good. But all of this additional data will need to be gathered, assessed, and cleaned. This is where I come in.

Project Details

Gathering Data

In this step, I will gather all three pieces of data as described in the "Data Gathering" section in the wrangle_act.ipynb notebook.

Note: the methods required to gather each data are different.

The WeRateDogs Twitter archive

Udacity gave this file to me, so imagine it as a file on hand. Downloaded this file manually by clicking the following link:

twitter_archive_enhanced.csv

Once it is downloaded, I uploaded it and read the data into a pandas DataFrame.

The tweet image predictions

This file (image_predictions.tsv) is present in each tweet according to a neural network. It is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

Additional data from the Twitter API

Gather each tweet's retweet count and favorite ("like") count at the minimum and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file.

Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

Assessing Data

After gathering all three pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in the "Accessing Data" section in the wrangle_act.ipynb Jupyter Notebook.

You need to use two types of assessment:

Visual assessment: each piece of gathered data is displayed in the Jupyter Notebook for visual assessment purposes. Once displayed, data can additionally be assessed in an external application (e.g. Excel, text editor).
Programmatic assessment: pandas' functions and/or methods are used to assess the data.

To meet specifications, the following issues must be assessed.

You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

If you need some help with the datasets, you can read the page: Project Motivation.

Cleaning Data

Clean all of the issues you documented while assessing. Perform this cleaning in the "Cleaning Data" section in the wrangle_act.ipynb.

Make sure you complete the following items in this step.

Before you perform the cleaning, you will make a copy of the original data.
During cleaning, use the define-code-test framework and clearly document it.
Cleaning includes merging individual pieces of data according to the rules of tidy data.

The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

Storing, Analyzing and Visualizing Data

Storing Data

In the "Storing Data" section in the wrangle_act.ipynb notebook, store the cleaned master DataFrame in a CSV file with the main one named twitter_archive_master.csv.

Analyzing and Visualizing Data

In the Analyzing and Visualizing Data section in my wrangle_act.ipynb Jupyter Notebook, I will analyze and visualize the wrangled data.

Report

I created a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

I also created a 250-word-minimum written report called act_report.pdf or act_report.html that communicates all the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
Correlation matrix.png		Correlation matrix.png
Distribution of Dog Stages.png		Distribution of Dog Stages.png
Dog Breeds.png		Dog Breeds.png
Dog Stage Favorite.png		Dog Stage Favorite.png
Dog Stages Retweet.png		Dog Stages Retweet.png
Dogtionary.png		Dogtionary.png
Enhanced Twitter Archive.png		Enhanced Twitter Archive.png
Image Predictions File.png		Image Predictions File.png
LICENSE		LICENSE
Period has highest activity.png		Period has highest activity.png
README.md		README.md
Scatterplot.png		Scatterplot.png
Tweets.png		Tweets.png
Udacity.png		Udacity.png
act_report.html		act_report.html
act_report.ipynb		act_report.ipynb
dog_tweet_1.jpg		dog_tweet_1.jpg
dog_tweet_2.jpg		dog_tweet_2.jpg
image_predictions.tsv		image_predictions.tsv
prediction_image.png		prediction_image.png
tweet_json.txt		tweet_json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_act.ipynb		wrangle_act.ipynb
wrangle_report.html		wrangle_report.html
wrangle_report.ipynb		wrangle_report.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project: Wrangle and Analyze WeRateDogs Twitter Archive

Table of Content

Introduction

What Softwares Do I Need?