Make sure you install both R and RStudio for this workshop.
Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download
If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0
by entering sessionInfo()
in your console.
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.2 magrittr_1.5 tools_4.0.2 htmltools_0.5.0
## [5] yaml_2.2.1 stringi_1.5.3 rmarkdown_2.4.7 knitr_1.30
## [9] stringr_1.4.0 xfun_0.18 digest_0.6.26 rlang_0.4.8
## [13] evaluate_0.14
For this workshop, we will be using three R packages. Make sure you install these packages before proceeding.
install.packages("tidytext")
install.packages("tidyverse")
install.packages("rvest")
Once the packages are installed, you can load them using library()
.
library(tidytext)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
We will be scraping Amazon reviews to build a corpus. In addition to the text of each review, we will be scraping the rating (number of stars out of 5 stars) of each review.
To scrape an individual page of reviews, first find the product you want to scrape the reviews for. I chose the AmazonBasics Pre-sharpened Wood Cased #2 HB Pencils. Scroll all the way down, past the reviews, and click on See all reviews
. This will direct you to the first page of reviews. Each page contains a total of 10 reviews. Click on Next page
and note how the url
changes. Go pack to the first page, copy the url
and create a url
variable with it.
# set url to desired review website
url <- "https://www.amazon.com/AmazonBasics-Pre-sharpened-Wood-Cased-Pencils/product-reviews/B071JM699P/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"
We can now read the html in from this url
using read_html
.
# download website
amazon_reviews <- read_html(url)
Inspecting the page source of our Amazon review page, we see that the node we want has class="review-text-content"
. Our next step is to parse the html code for these nodes.
# get nodes for review text (all 10)
review_text <- amazon_reviews %>%
html_nodes(".review-text-content") %>%
html_text()
We will also parse the html for the stars, which has class="a-icon-alt"
. These will give us three extra ratings, which are at the top of the page. We need to make sure we keep only the last 10 ratings using tail()
.
# get nodes for rating (last 10)
review_rate <- amazon_reviews %>%
html_nodes(".a-icon-alt") %>%
html_text() %>%
tail(10)
We can now combine the text
and the rate
(i.e., stars) in one data frame.
# build data frame
pencil_reviews <- data.frame(text = review_text,
rate = review_rate)
# inspect data
glimpse(pencil_reviews)
## Rows: 10
## Columns: 2
## $ text <chr> "\n\n\n\n\n\n\n\n\n\n \n \n \n Great price for order decen…
## $ rate <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "3.0 out of 5 stars"…
Since Amazon review pages has the number of each page in the url (i.e., pageNumber=1
), you can scrape multiple pages changing the page number and doing the same as we did above, with each page.
# do the same as above, but with 200 pages
# start with an empty data frame
pencil_reviews <- data.frame()
# base url has everything by the last number, which indicates page number
base_url <- "https://www.amazon.com/AmazonBasics-Pre-sharpened-Wood-Cased-Pencils/product-reviews/B071JM699P/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber="
# create a for loop for page number
for (i in 1:200){
# add base url to page number
url <- paste0(base_url, i)
# read in html file
amazon_reviews <- read_html(url)
# get nodes
review_text <- amazon_reviews %>%
html_nodes(".review-text-content") %>%
html_text()
review_rate <- amazon_reviews %>%
html_nodes(".a-icon-alt") %>%
html_text() %>%
tail(10)
# bind rows of this page data frame with the rest of the data
pencil_reviews <- bind_rows(pencil_reviews,
data.frame(text = review_text,
rate = review_rate)
)
}
The for loop above takes a couple of minutes to run. So here’s how to read in pre-scraped and saved data.
# read data in (so we don't have to run the for loop above during the workshop)
pencil_reviews <- read_csv("data/pencil_reviews.csv")
## Parsed with column specification:
## cols(
## text = col_character(),
## rate = col_character()
## )
Always inspect your data.
# inspect data
glimpse(pencil_reviews)
## Rows: 2,000
## Columns: 2
## $ text <chr> "\n\n\n\n\n\n\n\n\n\n \n \n \n Great price for order decen…
## $ rate <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "3.0 out of 5 stars"…
head(pencil_reviews)
## # A tibble: 6 x 2
## text rate
## <chr> <chr>
## 1 "\n\n\n\n\n\n\n\n\n\n \n \n \n Great price for order de… 4.0 out of 5 s…
## 2 "\n\n\n\n\n\n\n\n\n\n \n \n \n As a teacher I need penc… 1.0 out of 5 s…
## 3 "\n\n\n\n\n\n\n\n\n\n \n \n \n I got these pre-sharpene… 3.0 out of 5 s…
## 4 "\n\n\n\n\n\n\n\n\n\n \n \n \n I was so excited to buy … 1.0 out of 5 s…
## 5 "\n\n\n\n\n\n\n\n\n\n \n \n \n These are not great. Th… 3.0 out of 5 s…
## 6 "\n\n\n\n\n\n\n\n\n\n \n \n \n These pencils are hard t… 1.0 out of 5 s…
We need to remove line breaks (\n
).
# replace \n with a space
pencil_reviews %>%
mutate(text = gsub("\\n", " ", text)) %>%
head()
## # A tibble: 6 x 2
## text rate
## <chr> <chr>
## 1 " Great price for order decent pencils … 4.0 out of 5 s…
## 2 " As a teacher I need pencils to last, … 1.0 out of 5 s…
## 3 " I got these pre-sharpened pencils for… 3.0 out of 5 s…
## 4 " I was so excited to buy these for my … 1.0 out of 5 s…
## 5 " These are not great. The lead breaks… 3.0 out of 5 s…
## 6 " These pencils are hard to sharpen, th… 1.0 out of 5 s…
# trim leading white space
pencil_reviews %>%
mutate(text = gsub("\\n", "", text)) %>%
mutate(text = trimws(text)) %>%
head()
## # A tibble: 6 x 2
## text rate
## <chr> <chr>
## 1 Great price for order decent pencils although I'm a pencil sn… 4.0 out of 5 s…
## 2 As a teacher I need pencils to last, the first pencil I sharp… 1.0 out of 5 s…
## 3 I got these pre-sharpened pencils for convenience, because 50… 3.0 out of 5 s…
## 4 I was so excited to buy these for my classroom, but I would n… 1.0 out of 5 s…
## 5 These are not great. The lead breaks constantly. I always c… 3.0 out of 5 s…
## 6 These pencils are hard to sharpen, they break easy, and they … 1.0 out of 5 s…
# looks good, overwrite data
pencil_reviews <- pencil_reviews %>%
mutate(text = gsub("\\n", " ", text)) %>%
mutate(text = trimws(text))
Let’s remove out of 5 stars
from rate
.
# replace "out of 5 stars" with nothing
pencil_reviews %>%
mutate(rate = sub("out of 5 stars", "", rate)) %>%
select(rate) %>%
head()
## # A tibble: 6 x 1
## rate
## <chr>
## 1 "4.0 "
## 2 "1.0 "
## 3 "3.0 "
## 4 "1.0 "
## 5 "3.0 "
## 6 "1.0 "
# looks good, overwrite data
pencil_reviews <- pencil_reviews %>%
mutate(rate = sub("out of 5 stars", "", rate))
# inspect data
glimpse(pencil_reviews)
## Rows: 2,000
## Columns: 2
## $ text <chr> "Great price for order decent pencils although I'm a pencil snob…
## $ rate <chr> "4.0 ", "1.0 ", "3.0 ", "1.0 ", "3.0 ", "1.0 ", "1.0 ", "3.0 ", …
The variable rate
is a chr
type (character or string). We need it to be a numeric data type.
# mutate rate to be a number
pencil_reviews <- pencil_reviews %>%
mutate(rate = parse_number(rate))
# inspect data
glimpse(pencil_reviews)
## Rows: 2,000
## Columns: 2
## $ text <chr> "Great price for order decent pencils although I'm a pencil snob…
## $ rate <dbl> 4, 1, 3, 1, 3, 1, 1, 3, 5, 5, 4, 1, 1, 5, 5, 5, 5, 1, 5, 3, 5, 5…
We can now do things like mean of rate
,
mean(pencil_reviews$rate)
## [1] 4.408
Before we tokenize each review text, we need to create a unique review ID to calculate range later.
# add new review_id column
pencil_reviews <- pencil_reviews %>%
mutate(review_id = row_number())
Now, we can tokenize the text keeping info on what words belong to the same review.
# tokenize words
pencil_reviews_tokenized <- pencil_reviews %>%
unnest_tokens(word, text)
# inspect data
pencil_reviews_tokenized %>%
head()
## # A tibble: 6 x 3
## rate review_id word
## <dbl> <int> <chr>
## 1 4 1 great
## 2 4 1 price
## 3 4 1 for
## 4 4 1 order
## 5 4 1 decent
## 6 4 1 pencils
In addition to individual word tokenization, unnest_tokens()
offers a number of tokenization formatis, including ngrams. Here’s how to get bigrams.
# tokenize bigrams
pencil_reviews_bigrams <- pencil_reviews %>%
unnest_tokens(ngram, text, token = "ngrams", n = 2)
# inspect data
pencil_reviews_bigrams %>%
head()
## # A tibble: 6 x 3
## rate review_id ngram
## <dbl> <int> <chr>
## 1 1 2 as a
## 2 1 2 a teacher
## 3 1 2 teacher i
## 4 1 2 i need
## 5 1 2 need pencils
## 6 1 2 pencils to
We will be working with individual words from this point on.
Stop words are words that are very common in a language, but might not carry a lot of meaning, like function words. Stop words often include pronouns as well, modals, and frequent adverbs.
# check stop_words data frame
stop_words %>%
count(word) %>%
arrange(-n)
## # A tibble: 728 x 2
## word n
## <chr> <int>
## 1 down 4
## 2 would 4
## 3 a 3
## 4 about 3
## 5 above 3
## 6 after 3
## 7 again 3
## 8 against 3
## 9 all 3
## 10 an 3
## # … with 718 more rows
# the smallest lexicon is snowball
stop_words %>%
count(lexicon)
## # A tibble: 3 x 2
## lexicon n
## <chr> <int>
## 1 onix 404
## 2 SMART 571
## 3 snowball 174
Let’s filter the stop words to keep only words from the snowball
lexicon.
my_stop_words <- stop_words %>%
filter(lexicon == "snowball")
We now use this filtered data frame with an anti_join
to keep only words that are not in the stop words list.
# remove stop words from pencil reviews tokenized
pencil_reviews_clean <- pencil_reviews_tokenized %>%
anti_join(my_stop_words)
## Joining, by = "word"
With our data clean, we can start counting words.
# most frequent tokens per rate
pencil_reviews_clean %>%
count(word, rate) %>%
arrange(-n)
## # A tibble: 5,719 x 3
## word rate n
## <chr> <dbl> <int>
## 1 pencils 5 845
## 2 great 5 519
## 3 sharpened 5 339
## 4 good 5 324
## 5 school 5 252
## 6 pencil 5 203
## 7 quality 5 190
## 8 price 5 181
## 9 pencils 1 159
## 10 pre 5 153
## # … with 5,709 more rows
Plotting the data makes it easier to compare frequent tokens across different ratings.
pencil_reviews_clean %>%
count(word, rate) %>%
group_by(rate) %>%
top_n(10) %>%
ggplot(aes(x = n,
y = reorder_within(word, n, rate))) +
geom_col() +
facet_wrap(~rate, scales = "free_y") +
scale_y_reordered() +
labs(y = "")
## Selecting by n
Issues: unbalanced data, we don’t know what words are important to each category.
How many tokens per rate?
# count number of tokens (i.e., rows) per rate
pencil_reviews_clean %>%
count(rate)
## # A tibble: 5 x 2
## rate n
## <dbl> <int>
## 1 1 2870
## 2 2 1163
## 3 3 1594
## 4 4 2124
## 5 5 15186
We can normalize it.
# get total word count per rate, rename n so it's total instead
rate_word_count <- pencil_reviews_clean %>%
count(rate) %>%
rename(total = n)
# get count for each word per rate
word_count_per_rate <- pencil_reviews_clean %>%
count(word, rate)
# normalize individual word count
# first merge two counts
word_count_normalized <- left_join(word_count_per_rate,
rate_word_count)
## Joining, by = "rate"
# create a new normalized count column, with n divided by total
word_count_normalized <- word_count_normalized %>%
mutate(norm_n = (n/total)*1000)
Plotting it again, by normalized frequency instead.
word_count_normalized %>%
group_by(rate) %>%
top_n(10) %>%
ggplot(aes(x = norm_n,
y = reorder_within(word, norm_n, rate))) +
geom_col() +
facet_wrap(~rate, scales = "free_y") +
scale_y_reordered() +
labs(y = "")
## Selecting by norm_n
We can calculate term frequency inverse document frequency (tf-idf
) instead of normalized frequency. The goal in using tf-idf
is to decrease the weight for commonly used words (i.e., words used across all documents) and increase the weight for words that are less frequent in other documents in that collection.
# calculate tf-idf based on n, providing the word column and the category col
word_tf_idf <- word_count_normalized %>%
bind_tf_idf(word, rate, n)
# inspect data
word_tf_idf %>%
head()
## # A tibble: 6 x 8
## word rate n total norm_n tf idf tf_idf
## <chr> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 3 1 1594 0.627 0.000627 1.61 0.00101
## 2 0,1011 5 1 15186 0.0659 0.0000659 1.61 0.000106
## 3 0.08 5 1 15186 0.0659 0.0000659 1.61 0.000106
## 4 1 1 3 2870 1.05 0.00105 0.223 0.000233
## 5 1 2 2 1163 1.72 0.00172 0.223 0.000384
## 6 1 3 2 1594 1.25 0.00125 0.223 0.000280
We can also add range, to decide what words to keep and understand tf-idf
a little better.
# calculate range per word
word_range <- pencil_reviews_clean %>%
distinct(word, review_id) %>%
count(word) %>%
rename(range = n)
# add range to data frame with left_join
word_tf_idf <- left_join(word_tf_idf, word_range)
## Joining, by = "word"
# inspect data
word_tf_idf %>%
head()
## # A tibble: 6 x 9
## word rate n total norm_n tf idf tf_idf range
## <chr> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0 3 1 1594 0.627 0.000627 1.61 0.00101 1
## 2 0,1011 5 1 15186 0.0659 0.0000659 1.61 0.000106 1
## 3 0.08 5 1 15186 0.0659 0.0000659 1.61 0.000106 1
## 4 1 1 3 2870 1.05 0.00105 0.223 0.000233 14
## 5 1 2 2 1163 1.72 0.00172 0.223 0.000384 14
## 6 1 3 2 1594 1.25 0.00125 0.223 0.000280 14
# what's the mean range?
mean(word_tf_idf$range)
## [1] 13.74121
Plotting it again, by tf-idf
filtering by range.
word_tf_idf %>%
filter(range > 20) %>%
group_by(rate) %>%
top_n(n = 10, wt = tf_idf) %>%
ggplot(aes(x = tf_idf,
y = reorder_within(word, tf_idf, rate))) +
geom_col() +
facet_wrap(~rate, scales = "free_y") +
scale_y_reordered() +
labs(y = "")
Sentiment Analysis can be done a number of ways, and more recently, a number of machine learning algorithms have been used to try to extra sentiments (or polarity of opinion) from text. We will use here a much simpler approach, based on hand-coded sentiment for individual lexical units.
The package tidytext
has a sentiments
data set.
# first 10 rows
sentiments %>%
head(10)
## # A tibble: 10 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
# how many total words
sentiments %>%
nrow()
## [1] 6786
# how many for each sentiment
sentiments %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
We can use the function get_sentiments()
to get the sentiments for different lexicons (these lexicons are available under different licenses, so be sure that the license for the lexicon you want to use is appropriate for your project).
# AFINN from Finn Årup Nielsen (http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html)
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
# bing from Bing Liu and collaborators (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
# nrc from Saif Mohammad and Peter Turney (http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
We use inner_join
to keep only words that are in our sentiments
data frame, add the corresponding sentiment
column to each word.
# start with our clean tokenized data
# use inner_join to add sentiment from bing lexicon
pencil_reviews_sentiment <- pencil_reviews_clean %>%
inner_join(sentiments)
## Joining, by = "word"
# inspect data
pencil_reviews_sentiment %>%
count(rate, sentiment) %>%
arrange(n)
## # A tibble: 10 x 3
## rate sentiment n
## <dbl> <chr> <int>
## 1 4 negative 91
## 2 2 positive 106
## 3 2 negative 116
## 4 3 negative 134
## 5 3 positive 228
## 6 1 positive 251
## 7 1 negative 328
## 8 4 positive 340
## 9 5 negative 451
## 10 5 positive 2512
Let’s count how many positive and negative words we are left with.
# count tokens in each sentiment
pencil_reviews_sentiment %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 1120
## 2 positive 3437
Since we have a rating for each review, we can calculate the mean
and standard deviation
of these ratings per sentiment. The hypothesis here is that positive words will have a higher mean rating than negative words.
# mean rate of each sentiment
pencil_reviews_sentiment %>%
group_by(sentiment) %>%
summarise(mean_rate = mean(rate),
sd_rate = sd(rate))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## sentiment mean_rate sd_rate
## <chr> <dbl> <dbl>
## 1 negative 3.20 1.71
## 2 positive 4.38 1.20
# regression of sentiment by rate
pencil_reviews_sentiment %>%
mutate(dep_var = ifelse(sentiment == "negative", 0, 1)) %>%
lm(formula = dep_var ~ rate) %>%
summary()
##
## Call:
## lm(formula = dep_var ~ rate, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8512 0.1488 0.1488 0.1488 0.5760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.317235 0.018014 17.61 <2e-16 ***
## rate 0.106787 0.004154 25.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4024 on 4555 degrees of freedom
## Multiple R-squared: 0.1267, Adjusted R-squared: 0.1265
## F-statistic: 660.8 on 1 and 4555 DF, p-value: < 2.2e-16
We can also recalculate our tf-idf
based on our two new categories.
# first we count instances of each word per sentiment
word_sentiment_count <- pencil_reviews_sentiment %>%
count(word, sentiment)
# calculate tf-idf based on n, providing the word column and the category
# column, which is sentiment this time around
sentiment_tf_idf <- word_sentiment_count %>%
bind_tf_idf(word, sentiment, n)
# we can add range here too, which we already calculated
sentiment_tf_idf <- left_join(sentiment_tf_idf, word_range)
## Joining, by = "word"
# inspect data
sentiment_tf_idf %>%
head()
## # A tibble: 6 x 7
## word sentiment n tf idf tf_idf range
## <chr> <chr> <int> <dbl> <dbl> <dbl> <int>
## 1 abundant positive 1 0.000291 0.693 0.000202 1
## 2 abuse negative 3 0.00268 0.693 0.00186 3
## 3 achievement positive 1 0.000291 0.693 0.000202 1
## 4 adequate positive 1 0.000291 0.693 0.000202 1
## 5 advantage positive 1 0.000291 0.693 0.000202 1
## 6 afford positive 2 0.000582 0.693 0.000403 2
Plotting it by tf-idf
filtering by range across different sentiments.
sentiment_tf_idf %>%
group_by(sentiment) %>%
top_n(n = 10, wt = tf_idf) %>%
mutate(tf_idf = ifelse(sentiment == "negative",
-tf_idf,
tf_idf)) %>%
ggplot(aes(x = tf_idf,
y = reorder(word, tf_idf),
fill = sentiment)) +
geom_col() +
geom_label(aes(label = range), show.legend = FALSE) +
labs(y = "") +
ggtitle("Top 10 most important words per sentiment",
subtitle = "labels show range") +
theme_bw()
Please take some time to fill out this workshop’s exit survey.