Getting and Cleaning Data complete course is currently being offered by Johns Hopkins University through Coursera platform and is being taught by Brian Caffo, Jeff Leek and Roger D. Peng.

Before you can work with data you have to get some. This course will cover the basic ways that data can be obtained.

The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speed downstream data analysis tasks.

The course will also cover the components of a complete data set including raw data, processing instructions, code books, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.

Also Check: How to Apply for Coursera Financial Aid Getting and Cleaning Data Week 1 Quiz Answers

And load the data into R. The code book, describing the variable names is here:

How many housing units in this survey were worth more than \$1,000,000?

# fread url requires curl package on mac
# install.packages("curl")

library(data.table)

# VAL attribute says how much property is worth, .N is the number of rows
# VAL == 24 means more than \$1,000,000
housing[VAL == 24, .N]

Question 2) Use the data you loaded from Question 1. Consider the variable FES in the code book. Which of the "tidy data" principles does this variable violate?

Answer: Tidy data one variable per column

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:

dat

What is the value of:

sum(dat\$Zip*dat\$Ext,na.rm=T)
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)

fileUrl <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx"

dat <- xlsx::read.xlsx(file = "getdata%2Fdata%2FDATA.gov_NGAP.xlsx", sheetIndex = 1, rowIndex = 18:23, colIndex = 7:15)
sum(dat\$Zip*dat\$Ext,na.rm=T)

Question 4) Read the XML data on Baltimore restaurants from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

How many restaurants have zipcode 21231?

Use http instead of https, which caused the message Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'.

# install.packages("XML")
library("XML")
fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
doc <- XML::xmlTreeParse(sub("s", "", fileURL), useInternal = TRUE)
rootNode <- XML::xmlRoot(doc)

zipcodes <- XML::xpathSApply(rootNode, "//zipcode", XML::xmlValue)
xmlZipcodeDT <- data.table::data.table(zipcode = zipcodes)
xmlZipcodeDT[zipcode == "21231", .N]

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

DT

Which of the following is the fastest way to calculate the average value of the variable

pwgtp15

broken down by sex using the data.table package?

system.time(DT[,mean(pwgtp15),by=SEX])

Getting and Cleaning Data Week 2 Quiz Answers

Question 1) Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: this is the url you want "https://api.github.com/users/jtleek/repos").
Use this data to find the time that the datasharing repo was created. What time was it created? This tutorial may be useful (https://github.com/hadley/httr/blob/master/demo/oauth2-github.r). You may also need to run the code in the base R package and not R studio.

#install.packages("jsonlite")
#install.packages("httpuv")
#install.packages("httr")

library(jsonlite)
library(httpuv)
library(httr)

# Can be github, linkedin etc depending on application
oauth_endpoints("github")

# Change based on your appname, key, and secret
key = "8758a6bf9a146e1da0c1",
secret = "b9504edde46b794414495bd9c33ea28cbfd87824")

# Get OAuth credentials
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)

# Use API
gtoken <- config(token = github_token)
req <- GET("https://api.github.com/users/jtleek/repos", gtoken)

# Take action on http error
stop_for_status(req)

# Extract content from a request
json1 = content(req)

# Convert to a data.frame
gitDF = jsonlite::fromJSON(jsonlite::toJSON(json1))

# Subset data.frame
gitDF[gitDF\$full_name == "jtleek/datasharing", "created_at"]

Question 2) The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL.

Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?

# install.packages("sqldf")
library("sqldf")

url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
f <- file.path(getwd(), "ss06pid.csv")

Answer: query1 <- sqldf("select pwgtp1 from acs where AGEP < 50")

Question 3) Using the same data frame you created in the previous problem, what is the equivalent function to unique(acs\$AGEP)

Answer: sqldf("select distinct AGEP from acs")

Question 4) How many characters are in the 10th, 20th, 30th and 100th lines of HTML from this page:

http://biostat.jhsph.edu/~jleek/contact.html

(Hint: the nchar() function in R may be helpful)

connection <- url("http://biostat.jhsph.edu/~jleek/contact.html")
close(connection)
c(nchar(htmlCode), nchar(htmlCode), nchar(htmlCode), nchar(htmlCode))

Question 5) Read this data set into R and report the sum of the numbers in the fourth of the nine columns.

https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for

Original source of the data: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for
(Hint this is a fixed width file format)

url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for"
lines <- readLines(url, n = 10)
w <- c(1, 9, 5, 4, 1, 3, 5, 4, 1, 3, 5, 4, 1, 3, 5, 4, 1, 3)
colNames <- c("filler", "week", "filler", "sstNino12", "filler", "sstaNino12",
"filler", "sstNino3", "filler", "sstaNino3", "filler", "sstNino34", "filler",
"sstaNino34", "filler", "sstNino4", "filler", "sstaNino4")
d <- read.fwf(url, w, header = FALSE, skip = 4, col.names = colNames)
d <- d[, grep("^[^filler]", names(d))]
sum(d[, 4])

Getting and Cleaning Data Week 3 Quiz Answers

and load the data into R. The code book, describing the variable names is here:

Create a logical vector that identifies the households on greater than 10 acres who sold more than \$10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. which(agricultureLogical)

What are the first 3 values that result?

, 'ACS.csv'
, method='curl' )

agricultureLogical <- ACS\$ACR == 3 & ACS\$AGS == 6

Question 2) Using the jpeg package read in the following picture of your instructor into R

Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data?

# install.packages('jpeg')
library(jpeg)

, 'jeff.jpg'
, mode='wb' )

, native=TRUE)

# Get Sample Quantiles corressponding to given prob
quantile(picture, probs = c(0.3, 0.8) )

#       30%       80%
# -15259150 -10575416

Question 3) Load the Gross Domestic Product data for the 190 ranked countries in this data set:

Load the educational data from this data set:

Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank. What is the 13th country in the resulting data frame?

Original data sources: http://data.worldbank.org/data-catalog/GDP-ranking-table http://data.worldbank.org/data-catalog/ed-stats

# install.packages("data.table)
library("data.table")

, skip=4
, nrows = 190
, select = c(1, 2, 4, 5)
, col.names=c("CountryCode", "Rank", "Economy", "Total")
)

)

mergedDT <- merge(FGDP, FEDSTATS_Country, by = 'CountryCode')

# How many of the IDs match?
nrow(mergedDT)

# Sort the data frame in descending order by GDP rank (so United States is last).
# What is the 13th country in the resulting data frame?
mergedDT[order(-Rank)][13,.(Economy)]

#                Economy
# 1: St. Kitts and Nevis

Question 4) What is the average GDP ranking for the "High income: OECD" and "High income: nonOECD" group?

# "High income: OECD"
mergedDT[`Income Group` == "High income: OECD"
, lapply(.SD, mean)
, .SDcols = c("Rank")
, by = "Income Group"]

#
#         Income Group     Rank
# 1: High income: OECD 32.96667

# "High income: nonOECD"
mergedDT[`Income Group` == "High income: nonOECD"
, lapply(.SD, mean)
, .SDcols = c("Rank")
, by = "Income Group"]

#            Income Group     Rank
# 1: High income: nonOECD 91.91304

Question 5) Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries are Lower middle income but among the 38 nations with highest GDP?

# install.packages('dplyr')
library('dplyr')

breaks <- quantile(mergedDT[, Rank], probs = seq(0, 1, 0.2), na.rm = TRUE)
mergedDT\$quantileGDP <- cut(mergedDT[, Rank], breaks = breaks)
mergedDT[`Income Group` == "Lower middle income", .N, by = c("Income Group", "quantileGDP")]

Income Group quantileGDP  N
1: Lower middle income (38.6,76.2] 13
2: Lower middle income   (114,152]  9
3: Lower middle income   (152,190] 16
4: Lower middle income  (76.2,114] 11
5: Lower middle income    (1,38.6]  5

Getting and Cleaning Data Week 4 Quiz Answers

and load the data into R. The code book, describing the variable names is here:

Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list?

library("data.table")
varNamesSplit <- strsplit(names(communities), "wgtp")
varNamesSplit[]

# ""   "15"

Question 2) Load the Gross Domestic Product data for the 190 ranked countries in this data set:

Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?

Original data sources: http://data.worldbank.org/data-catalog/GDP-ranking-table

# Removed the s from https to be compatible with windows computers.
# Skip first 5 rows and only read in relevent columns
, skip=5
, nrows=190
, select = c(1, 2, 4, 5)
, col.names=c("CountryCode", "Rank", "Country", "GDP")
)

# Remove the commas using gsub
# Convert to integer after removing commas.
# Take mean of GDP column (I know this code may look a little confusing)
GDPrank[, mean(as.integer(gsub(pattern = ',', replacement = '', x = GDP )))]

# 377652.4

Question 3) In the data set from Question 2 what is a regular expression that would allow you to count the number of countries whose name begins with "United"? Assume that the variable with the country names in it is named countryNames. How many countries begin with United?

grep("^United",GDPrank[, Country])

Question 4) Load the Gross Domestic Product data for the 190 ranked countries in this data set:

Load the educational data from this data set:

Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?

Original data sources: http://data.worldbank.org/data-catalog/GDP-ranking-table http://data.worldbank.org/data-catalog/ed-stats

, skip=5
, nrows=190
, select = c(1, 2, 4, 5)
, col.names=c("CountryCode", "Rank", "Country", "GDP")
)

mergedDT <- merge(GDPrank, eduDT, by = 'CountryCode')

mergedDT[grepl(pattern = "Fiscal year end: June 30;", mergedDT[, `Special Notes`]), .N]

# 13

Question 5) You can use the quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon's stock price and get the times the data was sampled.

library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)

How many values were collected in 2012? How many values were collected on Mondays in 2012?

# install.packages("quantmod")
library("quantmod")
amzn <- getSymbols("AMZN",auto.assign=FALSE)
sampleTimes <- index(amzn)
timeDT <- data.table::data.table(timeCol = sampleTimes)

# How many values were collected in 2012?
timeDT[(timeCol >= "2012-01-01") & (timeCol) < "2013-01-01", .N ]