This workshop material was prepared for a workshop on corpus linguistics and Twitter mining for the NAU Corpus Club and COLISTO.

1 Install R and RStudio

Make sure you install both R and RStudio for this workshop.

  1. Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)

  2. Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.2  magrittr_1.5    tools_4.0.2     htmltools_0.5.0
##  [5] yaml_2.2.1      stringi_1.5.3   rmarkdown_2.4.7 knitr_1.30     
##  [9] stringr_1.4.0   xfun_0.18       digest_0.6.26   rlang_0.4.8    
## [13] evaluate_0.14

2 Install and Load Libraries

For this workshop, we will be using three R packages. Make sure you install these packages before proceeding.

install.packages("rtweet")
install.packages("tidytext")
install.packages("tidyverse")

Once the packages are installed, you can load them using library().

library(rtweet)
library(tidytext)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()  masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag()     masks stats::lag()

3 Download Tweets

In this workshop, we will download public users’ timelines. I chose two famous people in politics, you are welcome to choose other Twitter users. I also set n to 3,200 tweets, which is the maximum amount of tweets you can download without a developer key. Once you run the get_timeline() function below, your browser should pop-up an authentication request. So make sure you are logged to your Twitter account.

# get timelines
tweets <- get_timeline(c("AOC", "SpeakerPelosi"), n = 3200)

If you were timed out or were unable to download tweets, you can read in a file I prepared for this workshop so you can follow along the other steps.

# alternate get timelines
tweets <- readRDS("tweets.rds")

4 Inspect and clean tweets

Check how many tweets we retrieved per user.

# count tweets by user
tweets %>%
  count(screen_name)
## # A tibble: 2 x 2
##   screen_name       n
##   <chr>         <int>
## 1 AOC            3198
## 2 SpeakerPelosi  3200

Check what’s the date range for the tweets.

# get min and max of dates tweets were created by users
tweets %>%
  group_by(screen_name) %>%
  summarise(begin = min(created_at),
            end = max(created_at))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   screen_name   begin               end                
##   <chr>         <dttm>              <dttm>             
## 1 AOC           2019-06-30 13:43:17 2020-11-01 17:33:11
## 2 SpeakerPelosi 2018-07-25 17:10:15 2020-10-31 13:49:11

We are not interested in retweets, just original tweets.

# filter out (!) anything that is a retweet
original_tweets <- tweets %>%
  filter(!is_retweet)

# count tweets by user for original tweets
original_tweets %>%
  count(screen_name)
## # A tibble: 2 x 2
##   screen_name       n
##   <chr>         <int>
## 1 AOC            1927
## 2 SpeakerPelosi  2608

5 Tokenize the Text

Let’s reduce the number of variables in our data, so it’s more manageable. We are keeping only the first 7 columns, favorite count and retweet count.

tweets_smaller <- original_tweets %>%
  select(user_id:display_text_width,
         favorite_count, retweet_count)

Now we tokenize text.

# tokenize words
tweets_tokenized <- tweets_smaller %>%
  unnest_tokens(word, text)

# inspect data
tweets_tokenized %>%
  head()
## # A tibble: 6 x 9
##   user_id status_id created_at          screen_name source display_text_wi…
##   <chr>   <chr>     <dttm>              <chr>       <chr>             <dbl>
## 1 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 2 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 3 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 4 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 5 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 6 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## # … with 3 more variables: favorite_count <int>, retweet_count <int>,
## #   word <chr>

In addition to individual word tokenization, unnest_tokens() offers a number of tokenization formatis, including ngrams. Here’s how to get bigrams.

# tokenize bigrams
tweets_bigrams <- tweets_smaller %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 2)

# inspect data
tweets_bigrams %>%
  head()
## # A tibble: 6 x 9
##   user_id status_id created_at          screen_name source display_text_wi…
##   <chr>   <chr>     <dttm>              <chr>       <chr>             <dbl>
## 1 138203… 11453269… 2019-06-30 13:43:17 AOC         Twitt…              278
## 2 138203… 11453269… 2019-06-30 13:43:17 AOC         Twitt…              278
## 3 138203… 11453269… 2019-06-30 13:43:17 AOC         Twitt…              278
## 4 138203… 11453269… 2019-06-30 13:43:17 AOC         Twitt…              278
## 5 138203… 11453269… 2019-06-30 13:43:17 AOC         Twitt…              278
## 6 138203… 11453269… 2019-06-30 13:43:17 AOC         Twitt…              278
## # … with 3 more variables: favorite_count <int>, retweet_count <int>,
## #   ngram <chr>

6 Size of Sub-corpora

Now that we have one row for each word in our tweets_tokenized data frame, we can calculate the size of our sub-corpora using the count() function.

subcorpora_size <- tweets_tokenized %>%
  count(screen_name)

# inspect data
subcorpora_size
## # A tibble: 2 x 2
##   screen_name       n
##   <chr>         <int>
## 1 AOC           66948
## 2 SpeakerPelosi 89003

7 Remove Stop Words

Stop words are words that are very common in a language, but might not carry a lot of meaning, like function words. Stop words often include pronouns as well, modals, and frequent adverbs.

# check stop_words data frame
stop_words %>%
  count(word) %>%
  arrange(-n)
## # A tibble: 728 x 2
##    word        n
##    <chr>   <int>
##  1 down        4
##  2 would       4
##  3 a           3
##  4 about       3
##  5 above       3
##  6 after       3
##  7 again       3
##  8 against     3
##  9 all         3
## 10 an          3
## # … with 718 more rows
# the smallest lexicon is snowball
stop_words %>%
  count(lexicon)
## # A tibble: 3 x 2
##   lexicon      n
##   <chr>    <int>
## 1 onix       404
## 2 SMART      571
## 3 snowball   174

Let’s filter the stop words to keep only words from the snowball lexicon.

my_stop_words <- stop_words %>%
  filter(lexicon == "snowball")

We now use this filtered data frame with an anti_join to keep only words that are not in the stop words list.

# remove stop words from pencil reviews tokenized
tweets_tokenized_clean <- tweets_tokenized %>%
  anti_join(my_stop_words)
## Joining, by = "word"
# inspect data
tweets_tokenized_clean %>%
  head()
## # A tibble: 6 x 9
##   user_id status_id created_at          screen_name source display_text_wi…
##   <chr>   <chr>     <dttm>              <chr>       <chr>             <dbl>
## 1 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 2 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 3 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 4 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 5 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## 6 138203… 13229453… 2020-11-01 16:55:26 AOC         Twitt…               49
## # … with 3 more variables: favorite_count <int>, retweet_count <int>,
## #   word <chr>

8 Most frequent words per subcorpus

We also use count() to count the frequency of individual words per screen_name.

# arrange count so we see most frequent words first
tweets_tokenized_clean %>%
  count(word, screen_name) %>%
  arrange(-n) %>%
  head()
## # A tibble: 6 x 3
##   word  screen_name       n
##   <chr> <chr>         <int>
## 1 https SpeakerPelosi  2234
## 2 t.co  SpeakerPelosi  2234
## 3 t.co  AOC            1261
## 4 https AOC            1255
## 5 amp   SpeakerPelosi  1190
## 6 amp   AOC             846

Mmmmmm… the most frequent words are related to urls and other symbols. Let’s fix that.

We first need to create a list of tokens to remove.

tokens_to_remove <- c("https", "t.co", "amp", "et")

Now we can filter out (!) tokens in our list.

# remove stop words from pencil reviews tokenized
tweets_tokenized_clean <- tweets_tokenized_clean %>%
  filter(!(word %in% tokens_to_remove))

Count words again, see if it’s better.

# arrange count so we see most frequent words first
tweets_tokenized_clean %>%
  count(word, screen_name) %>%
  arrange(-n) %>%
  head()
## # A tibble: 6 x 3
##   word      screen_name       n
##   <chr>     <chr>         <int>
## 1 will      SpeakerPelosi   435
## 2 house     SpeakerPelosi   426
## 3 americans SpeakerPelosi   382
## 4 people    AOC             379
## 5 it’s      AOC             341
## 6 american  SpeakerPelosi   316
# looks good, create word_frequency_per_user data frame
word_frequency_per_user <- tweets_tokenized_clean %>%
  count(word, screen_name)

Plotting the data makes it easier to compare frequent tokens across different users.

word_frequency_per_user %>%
  group_by(screen_name) %>%
  top_n(20) %>%
  ggplot(aes(x = n, 
             y = reorder_within(word, n, screen_name))) +
  geom_col() +
  facet_wrap(~screen_name, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")
## Selecting by n

9 Normalized Frequencies

We calculated the size of each sub-corpora, so we can normalize the frequencies.

# inspect subcopora_size again
subcorpora_size
## # A tibble: 2 x 2
##   screen_name       n
##   <chr>         <int>
## 1 AOC           66948
## 2 SpeakerPelosi 89003
# change the n in the column name to total
subcorpora_size <- subcorpora_size %>%
  rename(total = n)

# inspect subcopora_size again
subcorpora_size
## # A tibble: 2 x 2
##   screen_name   total
##   <chr>         <int>
## 1 AOC           66948
## 2 SpeakerPelosi 89003

We can now join subcorpora_size with our word_frequency_per_user data frame by the column they have in common, which is screen_name.

word_frequency_per_user <- left_join(word_frequency_per_user,
                                     subcorpora_size)
## Joining, by = "screen_name"
# inspect data
word_frequency_per_user %>%
  head()
## # A tibble: 6 x 4
##   word       screen_name       n total
##   <chr>      <chr>         <int> <int>
## 1 _pamcampos AOC               1 66948
## 2 _vulvarine AOC               1 66948
## 3 ー         AOC               2 66948
## 4 ー         SpeakerPelosi     1 89003
## 5 0          AOC              12 66948
## 6 00         SpeakerPelosi     7 89003

We create a new column with normalized frequency using mutate().

word_frequency_per_user <- word_frequency_per_user  %>%
  mutate(n_norm = (n/total)*10000)

# inspect data
word_frequency_per_user %>%
  head()
## # A tibble: 6 x 5
##   word       screen_name       n total n_norm
##   <chr>      <chr>         <int> <int>  <dbl>
## 1 _pamcampos AOC               1 66948  0.149
## 2 _vulvarine AOC               1 66948  0.149
## 3 ー         AOC               2 66948  0.299
## 4 ー         SpeakerPelosi     1 89003  0.112
## 5 0          AOC              12 66948  1.79 
## 6 00         SpeakerPelosi     7 89003  0.786

Plot the data again, but by normalized frequency.

word_frequency_per_user %>%
  group_by(screen_name) %>%
  top_n(20) %>%
  ggplot(aes(x = n_norm, 
             y = reorder_within(word, n_norm, screen_name))) +
  geom_col() +
  facet_wrap(~screen_name, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")
## Selecting by n_norm

10 TF-IDF and Range

We can calculate term frequency inverse document frequency (tf-idf) instead of normalized frequency. The goal in using tf-idf is to decrease the weight for commonly used words (i.e., words used across all documents) and increase the weight for words that are less frequent in other documents in that collection.

# calculate tf-idf based on n, providing the word column and the category col
word_tf_idf <- word_frequency_per_user %>%
  bind_tf_idf(word, screen_name, n)

# inspect data
word_tf_idf %>%
  head()
## # A tibble: 6 x 8
##   word       screen_name       n total n_norm        tf   idf    tf_idf
##   <chr>      <chr>         <int> <int>  <dbl>     <dbl> <dbl>     <dbl>
## 1 _pamcampos AOC               1 66948  0.149 0.0000261 0.693 0.0000181
## 2 _vulvarine AOC               1 66948  0.149 0.0000261 0.693 0.0000181
## 3 ー         AOC               2 66948  0.299 0.0000521 0     0        
## 4 ー         SpeakerPelosi     1 89003  0.112 0.0000198 0     0        
## 5 0          AOC              12 66948  1.79  0.000313  0.693 0.000217 
## 6 00         SpeakerPelosi     7 89003  0.786 0.000139  0.693 0.0000961

We can also add range, to decide what words to keep and understand tf-idf a little better.

# calculate range per word (status_id indicates individual tweets)
word_range <- tweets_tokenized_clean %>%
  distinct(word, status_id) %>%
  count(word) %>%
  rename(range = n)

# add range to data frame with left_join
word_tf_idf <- left_join(word_tf_idf, word_range)
## Joining, by = "word"
# inspect data
word_tf_idf %>%
  head()
## # A tibble: 6 x 9
##   word       screen_name       n total n_norm        tf   idf    tf_idf range
##   <chr>      <chr>         <int> <int>  <dbl>     <dbl> <dbl>     <dbl> <int>
## 1 _pamcampos AOC               1 66948  0.149 0.0000261 0.693 0.0000181     1
## 2 _vulvarine AOC               1 66948  0.149 0.0000261 0.693 0.0000181     1
## 3 ー         AOC               2 66948  0.299 0.0000521 0     0             3
## 4 ー         SpeakerPelosi     1 89003  0.112 0.0000198 0     0             3
## 5 0          AOC              12 66948  1.79  0.000313  0.693 0.000217     10
## 6 00         SpeakerPelosi     7 89003  0.786 0.000139  0.693 0.0000961     7
# what's the mean range?
mean(word_tf_idf$range)
## [1] 8.025293

Plotting it again, by tf-idf filtering by range.

word_tf_idf %>%
  filter(range > 5) %>%
  group_by(screen_name) %>%
  top_n(n = 10, wt = tf_idf) %>%
  ggplot(aes(x = tf_idf, 
             y = reorder_within(word, tf_idf, screen_name))) +
  geom_col() +
  facet_wrap(~screen_name, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")