Module 12 Getting Data
12.1 Search for data sets
There are number of websites that are repositories of data sets. Here’s a list of some resources:
Kaggle Data Sets
Google Dataset Search
U.S. Department of Education Public Data Listing
US Department of Health and Human Services, Datasets & Research Resources
City of Tucson Open Data
12.2 Extracting data tables from websites
Other times you will find data available in webpages, or in HTML format. Lucky for us again, there’s an R package to extract tables from html files.
As usual, we need to install the package first.
Remember we need to install a package only once (and updated it once in a while), but every time we want to use it, we need to call it with the library()
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## guess_encoding
Let’s check what tables there are in UArizona’s wikipedia page.
First, we need to read in the html file.
We now parse the html for tables.
## {xml_nodeset (19)}
## [1] <table class="infobox vcard">\n<caption class="infobox-title fn org">Uni ...
## [2] <table class="multicol" role="presentation" style="border-collapse: coll ...
## [3] <table class="infobox" style="width: 22em"><tbody>\n<tr><th colspan="2" ...
## [4] <table class="wikitable sortable collapsible collapsed" style="float:rig ...
## [5] <table class="wikitable sortable collapsible collapsed" style="float:rig ...
## [6] <table style="float:right; font-size:85%; margin:10px" class="wikitable" ...
## [7] <table role="presentation" class="mbox-small plainlinks sistersitebox" s ...
## [8] <table class="nowraplinks hlist mw-collapsible mw-collapsed navbox-inner ...
## [9] <table class="nowraplinks mw-collapsible mw-collapsed navbox-inner" styl ...
## [10] <table class="nowraplinks mw-collapsible mw-collapsed navbox-inner" styl ...
## [11] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [12] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [13] <table class="nowraplinks navbox-subgroup" style="border-spacing:0"><tbo ...
## [14] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [15] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [16] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [17] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [18] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" styl ...
## [19] <table class="nowraplinks hlist navbox-inner" style="border-spacing:0;ba ...
Too many tables. We can be specific, and retrieve nodes per class.
## {xml_nodeset (3)}
## [1] <table class="wikitable sortable collapsible collapsed" style="float:righ ...
## [2] <table class="wikitable sortable collapsible collapsed" style="float:righ ...
## [3] <table style="float:right; font-size:85%; margin:10px" class="wikitable"> ...
This looks a little better.
It looks like the table we want is the third table.
# create wiki_tables object
wiki_tables <- uarizona_wiki_html %>%
# transform node into an actual table
fall_freshman_stats <- wiki_tables[[3]] %>%
html_table(fill = TRUE)
# check data
## # A tibble: 7 x 6
## `` `2017` `2016` `2015` `2014` `2013`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Applicants 36,166 35,236 32,723 26,481 26,329
## 2 Admits 28,433 26,961 24,417 20,546 20,251
## 3 % Admitted 78.6 76.5 74.6 77.5 76.9
## 4 Enrolled 7,360 7,753 7,466 7,744 6,881
## 5 Avg GPA 3.43 3.48 3.38 3.37 3.40
## 6 SAT range* 1015–1250 1010–1230 1010–1230 1000–1230 990–1220
## 7 * SAT out of 1600 <NA> <NA> <NA> <NA> <NA>
Tidy it.
# first column name is blank
colnames(fall_freshman_stats)[1] <- "type"
# pivot years
fall_freshman_stats <- fall_freshman_stats %>%
pivot_longer(cols = "2017":"2013",
names_to = "year")
# make value a number
fall_freshman_stats <- fall_freshman_stats %>%
mutate(value = as.numeric(parse_number(value)))
# inspect data
## Rows: 35
## Columns: 3
## $ type <chr> "Applicants", "Applicants", "Applicants", "Applicants", "Applica…
## $ year <chr> "2017", "2016", "2015", "2014", "2013", "2017", "2016", "2015", …
## $ value <dbl> 36166.00, 35236.00, 32723.00, 26481.00, 26329.00, 28433.00, 2696…
Plot it.
12.3 Project Proposal
Project Proposal is due April 06 2021.