This workshop material was prepared for a workshop on corpus linguistics and Twitter mining for the NAU Corpus Club and COLISTO.
Make sure you install both R and RStudio for this workshop.
Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download
If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0
by entering sessionInfo()
in your console.
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.2 magrittr_1.5 tools_4.0.2 htmltools_0.5.0
## [5] yaml_2.2.1 stringi_1.5.3 rmarkdown_2.4.7 knitr_1.30
## [9] stringr_1.4.0 xfun_0.18 digest_0.6.26 rlang_0.4.8
## [13] evaluate_0.14
For this workshop, we will be using three R packages. Make sure you install these packages before proceeding.
install.packages("rtweet")
install.packages("tidytext")
install.packages("tidyverse")
Once the packages are installed, you can load them using library()
.
library(rtweet)
library(tidytext)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag() masks stats::lag()
In this workshop, we will download public users’ timelines. I chose two famous people in politics, you are welcome to choose other Twitter users. I also set n
to 3,200
tweets, which is the maximum amount of tweets you can download without a developer key. Once you run the get_timeline()
function below, your browser should pop-up an authentication request. So make sure you are logged to your Twitter account.
# get timelines
tweets <- get_timeline(c("AOC", "SpeakerPelosi"), n = 3200)
If you were timed out or were unable to download tweets, you can read in a file I prepared for this workshop so you can follow along the other steps.
# alternate get timelines
tweets <- readRDS("tweets.rds")
Check how many tweets we retrieved per user.
# count tweets by user
tweets %>%
count(screen_name)
## # A tibble: 2 x 2
## screen_name n
## <chr> <int>
## 1 AOC 3198
## 2 SpeakerPelosi 3200
Check what’s the date range for the tweets.
# get min and max of dates tweets were created by users
tweets %>%
group_by(screen_name) %>%
summarise(begin = min(created_at),
end = max(created_at))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## screen_name begin end
## <chr> <dttm> <dttm>
## 1 AOC 2019-06-30 13:43:17 2020-11-01 17:33:11
## 2 SpeakerPelosi 2018-07-25 17:10:15 2020-10-31 13:49:11
We are not interested in retweets, just original tweets.
# filter out (!) anything that is a retweet
original_tweets <- tweets %>%
filter(!is_retweet)
# count tweets by user for original tweets
original_tweets %>%
count(screen_name)
## # A tibble: 2 x 2
## screen_name n
## <chr> <int>
## 1 AOC 1927
## 2 SpeakerPelosi 2608
Let’s reduce the number of variables in our data, so it’s more manageable. We are keeping only the first 7 columns, favorite count and retweet count.
tweets_smaller <- original_tweets %>%
select(user_id:display_text_width,
favorite_count, retweet_count)
Now we tokenize text
.
# tokenize words
tweets_tokenized <- tweets_smaller %>%
unnest_tokens(word, text)
# inspect data
tweets_tokenized %>%
head()
## # A tibble: 6 x 9
## user_id status_id created_at screen_name source display_text_wi…
## <chr> <chr> <dttm> <chr> <chr> <dbl>
## 1 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 2 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 3 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 4 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 5 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 6 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## # … with 3 more variables: favorite_count <int>, retweet_count <int>,
## # word <chr>
In addition to individual word tokenization, unnest_tokens()
offers a number of tokenization formatis, including ngrams. Here’s how to get bigrams.
# tokenize bigrams
tweets_bigrams <- tweets_smaller %>%
unnest_tokens(ngram, text, token = "ngrams", n = 2)
# inspect data
tweets_bigrams %>%
head()
## # A tibble: 6 x 9
## user_id status_id created_at screen_name source display_text_wi…
## <chr> <chr> <dttm> <chr> <chr> <dbl>
## 1 138203… 11453269… 2019-06-30 13:43:17 AOC Twitt… 278
## 2 138203… 11453269… 2019-06-30 13:43:17 AOC Twitt… 278
## 3 138203… 11453269… 2019-06-30 13:43:17 AOC Twitt… 278
## 4 138203… 11453269… 2019-06-30 13:43:17 AOC Twitt… 278
## 5 138203… 11453269… 2019-06-30 13:43:17 AOC Twitt… 278
## 6 138203… 11453269… 2019-06-30 13:43:17 AOC Twitt… 278
## # … with 3 more variables: favorite_count <int>, retweet_count <int>,
## # ngram <chr>
Now that we have one row for each word in our tweets_tokenized
data frame, we can calculate the size of our sub-corpora using the count()
function.
subcorpora_size <- tweets_tokenized %>%
count(screen_name)
# inspect data
subcorpora_size
## # A tibble: 2 x 2
## screen_name n
## <chr> <int>
## 1 AOC 66948
## 2 SpeakerPelosi 89003
Stop words are words that are very common in a language, but might not carry a lot of meaning, like function words. Stop words often include pronouns as well, modals, and frequent adverbs.
# check stop_words data frame
stop_words %>%
count(word) %>%
arrange(-n)
## # A tibble: 728 x 2
## word n
## <chr> <int>
## 1 down 4
## 2 would 4
## 3 a 3
## 4 about 3
## 5 above 3
## 6 after 3
## 7 again 3
## 8 against 3
## 9 all 3
## 10 an 3
## # … with 718 more rows
# the smallest lexicon is snowball
stop_words %>%
count(lexicon)
## # A tibble: 3 x 2
## lexicon n
## <chr> <int>
## 1 onix 404
## 2 SMART 571
## 3 snowball 174
Let’s filter the stop words to keep only words from the snowball
lexicon.
my_stop_words <- stop_words %>%
filter(lexicon == "snowball")
We now use this filtered data frame with an anti_join
to keep only words that are not in the stop words list.
# remove stop words from pencil reviews tokenized
tweets_tokenized_clean <- tweets_tokenized %>%
anti_join(my_stop_words)
## Joining, by = "word"
# inspect data
tweets_tokenized_clean %>%
head()
## # A tibble: 6 x 9
## user_id status_id created_at screen_name source display_text_wi…
## <chr> <chr> <dttm> <chr> <chr> <dbl>
## 1 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 2 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 3 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 4 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 5 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## 6 138203… 13229453… 2020-11-01 16:55:26 AOC Twitt… 49
## # … with 3 more variables: favorite_count <int>, retweet_count <int>,
## # word <chr>
We also use count()
to count the frequency of individual words per screen_name
.
# arrange count so we see most frequent words first
tweets_tokenized_clean %>%
count(word, screen_name) %>%
arrange(-n) %>%
head()
## # A tibble: 6 x 3
## word screen_name n
## <chr> <chr> <int>
## 1 https SpeakerPelosi 2234
## 2 t.co SpeakerPelosi 2234
## 3 t.co AOC 1261
## 4 https AOC 1255
## 5 amp SpeakerPelosi 1190
## 6 amp AOC 846
Mmmmmm… the most frequent words are related to urls and other symbols. Let’s fix that.
We first need to create a list of tokens to remove.
tokens_to_remove <- c("https", "t.co", "amp", "et")
Now we can filter out (!) tokens in our list.
# remove stop words from pencil reviews tokenized
tweets_tokenized_clean <- tweets_tokenized_clean %>%
filter(!(word %in% tokens_to_remove))
Count words again, see if it’s better.
# arrange count so we see most frequent words first
tweets_tokenized_clean %>%
count(word, screen_name) %>%
arrange(-n) %>%
head()
## # A tibble: 6 x 3
## word screen_name n
## <chr> <chr> <int>
## 1 will SpeakerPelosi 435
## 2 house SpeakerPelosi 426
## 3 americans SpeakerPelosi 382
## 4 people AOC 379
## 5 it’s AOC 341
## 6 american SpeakerPelosi 316
# looks good, create word_frequency_per_user data frame
word_frequency_per_user <- tweets_tokenized_clean %>%
count(word, screen_name)
Plotting the data makes it easier to compare frequent tokens across different users.
word_frequency_per_user %>%
group_by(screen_name) %>%
top_n(20) %>%
ggplot(aes(x = n,
y = reorder_within(word, n, screen_name))) +
geom_col() +
facet_wrap(~screen_name, scales = "free_y") +
scale_y_reordered() +
labs(y = "")
## Selecting by n
We calculated the size of each sub-corpora, so we can normalize the frequencies.
# inspect subcopora_size again
subcorpora_size
## # A tibble: 2 x 2
## screen_name n
## <chr> <int>
## 1 AOC 66948
## 2 SpeakerPelosi 89003
# change the n in the column name to total
subcorpora_size <- subcorpora_size %>%
rename(total = n)
# inspect subcopora_size again
subcorpora_size
## # A tibble: 2 x 2
## screen_name total
## <chr> <int>
## 1 AOC 66948
## 2 SpeakerPelosi 89003
We can now join subcorpora_size
with our word_frequency_per_user
data frame by the column they have in common, which is screen_name
.
word_frequency_per_user <- left_join(word_frequency_per_user,
subcorpora_size)
## Joining, by = "screen_name"
# inspect data
word_frequency_per_user %>%
head()
## # A tibble: 6 x 4
## word screen_name n total
## <chr> <chr> <int> <int>
## 1 _pamcampos AOC 1 66948
## 2 _vulvarine AOC 1 66948
## 3 ー AOC 2 66948
## 4 ー SpeakerPelosi 1 89003
## 5 0 AOC 12 66948
## 6 00 SpeakerPelosi 7 89003
We create a new column with normalized frequency using mutate()
.
word_frequency_per_user <- word_frequency_per_user %>%
mutate(n_norm = (n/total)*10000)
# inspect data
word_frequency_per_user %>%
head()
## # A tibble: 6 x 5
## word screen_name n total n_norm
## <chr> <chr> <int> <int> <dbl>
## 1 _pamcampos AOC 1 66948 0.149
## 2 _vulvarine AOC 1 66948 0.149
## 3 ー AOC 2 66948 0.299
## 4 ー SpeakerPelosi 1 89003 0.112
## 5 0 AOC 12 66948 1.79
## 6 00 SpeakerPelosi 7 89003 0.786
Plot the data again, but by normalized frequency.
word_frequency_per_user %>%
group_by(screen_name) %>%
top_n(20) %>%
ggplot(aes(x = n_norm,
y = reorder_within(word, n_norm, screen_name))) +
geom_col() +
facet_wrap(~screen_name, scales = "free_y") +
scale_y_reordered() +
labs(y = "")
## Selecting by n_norm
We can calculate term frequency inverse document frequency (tf-idf
) instead of normalized frequency. The goal in using tf-idf
is to decrease the weight for commonly used words (i.e., words used across all documents) and increase the weight for words that are less frequent in other documents in that collection.
# calculate tf-idf based on n, providing the word column and the category col
word_tf_idf <- word_frequency_per_user %>%
bind_tf_idf(word, screen_name, n)
# inspect data
word_tf_idf %>%
head()
## # A tibble: 6 x 8
## word screen_name n total n_norm tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 _pamcampos AOC 1 66948 0.149 0.0000261 0.693 0.0000181
## 2 _vulvarine AOC 1 66948 0.149 0.0000261 0.693 0.0000181
## 3 ー AOC 2 66948 0.299 0.0000521 0 0
## 4 ー SpeakerPelosi 1 89003 0.112 0.0000198 0 0
## 5 0 AOC 12 66948 1.79 0.000313 0.693 0.000217
## 6 00 SpeakerPelosi 7 89003 0.786 0.000139 0.693 0.0000961
We can also add range, to decide what words to keep and understand tf-idf
a little better.
# calculate range per word (status_id indicates individual tweets)
word_range <- tweets_tokenized_clean %>%
distinct(word, status_id) %>%
count(word) %>%
rename(range = n)
# add range to data frame with left_join
word_tf_idf <- left_join(word_tf_idf, word_range)
## Joining, by = "word"
# inspect data
word_tf_idf %>%
head()
## # A tibble: 6 x 9
## word screen_name n total n_norm tf idf tf_idf range
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 _pamcampos AOC 1 66948 0.149 0.0000261 0.693 0.0000181 1
## 2 _vulvarine AOC 1 66948 0.149 0.0000261 0.693 0.0000181 1
## 3 ー AOC 2 66948 0.299 0.0000521 0 0 3
## 4 ー SpeakerPelosi 1 89003 0.112 0.0000198 0 0 3
## 5 0 AOC 12 66948 1.79 0.000313 0.693 0.000217 10
## 6 00 SpeakerPelosi 7 89003 0.786 0.000139 0.693 0.0000961 7
# what's the mean range?
mean(word_tf_idf$range)
## [1] 8.025293
Plotting it again, by tf-idf
filtering by range.
word_tf_idf %>%
filter(range > 5) %>%
group_by(screen_name) %>%
top_n(n = 10, wt = tf_idf) %>%
ggplot(aes(x = tf_idf,
y = reorder_within(word, tf_idf, screen_name))) +
geom_col() +
facet_wrap(~screen_name, scales = "free_y") +
scale_y_reordered() +
labs(y = "")