1 Before the workshop
2 Data for the Workshop
3 Install and Load Tidverse
4 Read Text Files as a Corpus in R
5 Search your Corpus
- 5.1 Regular expressions in R
- 5.2 Key Word In Context (KWIC)
6 MICUSP References
7 More Resources
8 Exit Survey

1 Before the workshop

Make sure you install both R and RStudio for this workshop.

Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()

2 Data for the Workshop

I will be providing you with 50 text files from MISCUSP in the English discipline subset. I’d like to thank Dr. Ute Römer for allowing us to use these files for this workshop. For more information on MICUSP check the MICUSP references at the end of this page.

These text files have been annotated with the Stanford Dependency Parser and are thus in tab separated format (CoNLL-U Format). If you need instructions on how to tag your text files with the Stanfor Dependency Parser, check these workshop materials by Larissa Goulart.

This data is available for the purposes of this workshop only.

3 Install and Load Tidverse

For this workshop, we will be using the tidyverse package. Make sure you install it before proceeding.

install.packages("tidyverse")

Once the package is installed, you can load it using library().

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

4 Read Text Files as a Corpus in R

The easiest way to read multiple files in R and combine all of them in the same data frame is by using a for loop.

We first create a list of the files we want to read in with list.files().

# create list of files to read in
files <- list.files(path = "raw_corpus",
                    pattern = "*.txt.conll",
                    full.names = TRUE)

We then create an empty data frame with annotated_files <- data.frame() which we bind_rows with each file in our for loop.

# read in all files
# first create empty data frame
annotated_files <- data.frame()
# for each file in our list of files
for (i in 1:length(files)){
  # read the tab separated file in
  this_file <- read_tsv(files[i], col_names = FALSE)
  # create a column with the filename, do some clean up of the file name
  this_file$filename <- gsub("\\.txt\\.conll|raw_corpus\\/", "", files[i])
  # combine this file with all the others
  annotated_files <- bind_rows(annotated_files,
                               this_file)
}

These annotated files do not have a header, so the column names are automatically provided by R and are X1, X2 and so on. We can change the column names using rename()

# change column names
annotated_files <- annotated_files %>%
  rename(token_number = X1,
         token = X2,
         lemma = X3,
         pos = X4,
         entity = X5,
         dependency = X6,
         dep_label = X7)

Right now we only have information about each individual token. Adding which tokens belong together in the same sentence allows us to retrieve the whole sentence in a search. The first step is to add a sentence number. Again we are going to use a for loop and a sentence counter (i.e., sentence_count) that increases by 1 every time a new sentence starts (i.e., the token number is equal to 1).

# add sentence number
# this for loop takes a while to run
# start with creating a column for sentence number
annotated_files$sentence_number <- NA
# start the sentence counter
sentence_count <- 0
# for every row/token in our corpus
for (i in 1:nrow(annotated_files)) {
  # if the token number is one, that indicates a new sentence start
  if (annotated_files$token_number[i] == 1) {
    # add to sentence counter
    sentence_count <- sentence_count + 1
  }
  # add sentence_count to the appropriate column and row (i.e., row i)
  annotated_files$sentence_number[i] <- sentence_count
}

Now that we have what tokens belong to the same sentence together, we can create a new column in our data with all tokens collapsed together.

# now that we have sentence number, collapse tokens by sentence in a new
# sentence column/variable
annotated_files <- annotated_files %>%
  group_by(sentence_number) %>%
  mutate(sentence = paste(token, collapse = " ")) %>%
  ungroup()

Always ungroup() when creating new object that results from a group_by() – this will save you some trouble in the future.

We are done with processing our corpus. To save this processing so you don’t have to run this every time you need to analyze or search your corpus, save the data frame to disk.

# write file out so we don't have to run the whole thing again
write_csv(annotated_files, "processed_corpus/micusp_engl_subset.csv")

Here’s the entire 01-read-corpus-files.R code:

# load libraries
library(tidyverse)

# create list of files to read in
files <- list.files(path = "raw_corpus",
                    pattern = "*.txt.conll",
                    full.names = TRUE)

# read in all files
# first create empty data frame
annotated_files <- data.frame()
# for each file in our list of files
for (i in 1:length(files)){
  # read the tab separated file in
  this_file <- read_tsv(files[i], col_names = FALSE)
  # create a column with the filename, do some clean up of the file name
  this_file$filename <- gsub("\\.txt\\.conll|raw_corpus\\/", "", files[i])
  # combine this file with all the others
  annotated_files <- bind_rows(annotated_files,
                               this_file)
}

# change column names
annotated_files <- annotated_files %>%
  rename(token_number = X1,
         token = X2,
         lemma = X3,
         pos = X4,
         entity = X5,
         dependency = X6,
         dep_label = X7)


# add sentence number
# this for loop takes a while to run
# start with creating a column for sentence number
annotated_files$sentence_number <- NA
# start the sentence counter
sentence_count <- 0
# for every row/token in our corpus
for (i in 1:nrow(annotated_files)) {
  # if the token number is one, that indicates a new sentence start
  if (annotated_files$token_number[i] == 1) {
    # add to sentence counter
    sentence_count <- sentence_count + 1
  }
  # add sentence_count to the appropriate column and row (i.e., row i)
  annotated_files$sentence_number[i] <- sentence_count
}

# now that we have sentence number, collapse tokens by sentence in a new
# sentence column/variable
annotated_files <- annotated_files %>%
  group_by(sentence_number) %>%
  mutate(sentence = paste(token, collapse = " ")) %>%
  ungroup()

# write file out so we don't have to run the whole thing again
write_csv(annotated_files, "processed_corpus/micusp_engl_subset.csv")

5 Search your Corpus

I start my second R script with loading libraries and my data.

# load library
library(tidyverse)

# read data in
annotated_files <- read_csv("processed_corpus/micusp_engl_subset.csv")

Let’s start with the simplest way to search our corpus.

# set search expression to desired string
search_expression <- "argue"

# simplest way, just match the lemma with the search_expression
annotated_files %>%
  filter(lemma == search_expression) %>% 
  select(sentence_number, sentence)

## # A tibble: 17 x 2
##    sentence_number sentence                                                     
##              <dbl> <chr>                                                        
##  1             178 One could argue that the United States is already concerned …
##  2             966 Levi argues that the reaction of those in the Lager , while …
##  3            1487 I will argue that it is this fractured identity combined wit…
##  4            1491 McCullough argues that for the Casas family , Cuban exile re…
##  5            1511 Obejas argues that `` culturally we 're deﬁned by our famili…
##  6            1607 However , I would argue that McCullough 's assertion should …
##  7            1788 Jesus is seemingly arguing that men and women are equal in t…
##  8            1833 In fact , it is arguable that the Gospel of Luke includes mo…
##  9            1986 The invention of the printing press and the rise of literary…
## 10            1993 As Zipes argues , however , industrialization itself was not…
## 11            2001 Zipes argues that technology was the fuel for Walt Disney 's…
## 12            2015 Jack Zipes argues that fairy tales should be accessible to e…
## 13            2462 Stephen then goes on to refer to Arius , the first century h…
## 14            3734 In another vain , it could be argued that these authors are …
## 15            3741 This point must be made , because some may argue that Ondaat…
## 16            3775 Assuming that a very small portion of the books audience spe…
## 17            3989 While some may argue that the use of graphics is a way to du…

5.1 Regular expressions in R

In this workshop, we will use two functions that make use of regular expressions:

grepl()
gsub()

In your console, enter first ?grepl() and then ?gsub() to read the help pages for these functions.

The function grepl() returns TRUE for when it finds a match and FALSE when it doesn’t, which makes it a good function to use for filtering data. Here are few simple examples of grepl() use.

# the pattern "argue" can be found in "argues"
grepl("argue", "argues")

## [1] TRUE

# the pattern "argue" is not found in "arguing"
grepl("argue", "arguing")

## [1] FALSE

# use period to mean any character
grepl("argu.", "arguing")

## [1] TRUE

# period means ANY character
grepl("argu.", c("argues", "arguing", "argu7"))

## [1] TRUE TRUE TRUE

# you can use [a-z] to mean any lower-case letter
grepl("argu[a-z]", c("argues", "arguing", "argu7"))

## [1]  TRUE  TRUE FALSE

Here’s a Basic Regular Expressions in R Cheat Sheet.

The function gsub() replaces global (i.e., all) matches with a replacement string, which makes it a good function to use for changing data with mutate(). Here are few simple examples of gsub() use.

# the pattern "es" is found in "argues" and is replaced with "ing"
sub("es", "ing", "argues")

## [1] "arguing"

# as with grepl, you the string to find the pattern in can be a vector
sub("es", "ing", c("argues", "argue"))

## [1] "arguing" "argue"

# use ? to make a character optional
sub("es?", "ing", c("argues", "argue"))

## [1] "arguing" "arguing"

# use gsub to replace all matches, not just the first one
sub("es?", "ing", "He argues and I argue")

## [1] "Hing argues and I argue"

# use gsub to replace all matches, not just the first one
gsub("es?", "ing", "He argues and I argue")

## [1] "Hing arguing and I arguing"

# to match something and repeat that something in the replacement we can use
# grouping like \\1 and \\2 and so on. Groups are defined by parentheses.
# use + to mean one or more of something
gsub("([a-z]+)es?", "will be \\1ing", "He argues and I argue")

## [1] "He will be arguing and I will be arguing"

Going back to our original corpus search. Instead of matching exactly the same string, we can use the function grepl() to use our search_expression as a regular expression.

# simplest way, just match the lemma with the search_expression
annotated_files %>%
  filter(grepl(search_expression, lemma)) %>% 
  select(sentence_number, sentence)

## # A tibble: 17 x 2
##    sentence_number sentence                                                     
##              <dbl> <chr>                                                        
##  1             178 One could argue that the United States is already concerned …
##  2             966 Levi argues that the reaction of those in the Lager , while …
##  3            1487 I will argue that it is this fractured identity combined wit…
##  4            1491 McCullough argues that for the Casas family , Cuban exile re…
##  5            1511 Obejas argues that `` culturally we 're deﬁned by our famili…
##  6            1607 However , I would argue that McCullough 's assertion should …
##  7            1788 Jesus is seemingly arguing that men and women are equal in t…
##  8            1833 In fact , it is arguable that the Gospel of Luke includes mo…
##  9            1986 The invention of the printing press and the rise of literary…
## 10            1993 As Zipes argues , however , industrialization itself was not…
## 11            2001 Zipes argues that technology was the fuel for Walt Disney 's…
## 12            2015 Jack Zipes argues that fairy tales should be accessible to e…
## 13            2462 Stephen then goes on to refer to Arius , the first century h…
## 14            3734 In another vain , it could be argued that these authors are …
## 15            3741 This point must be made , because some may argue that Ondaat…
## 16            3775 Assuming that a very small portion of the books audience spe…
## 17            3989 While some may argue that the use of graphics is a way to du…

We can highlight our search term in our results. For that, we need to change our search_expression slightly by adding parentheses to it, so we can call it in our gsub() function with \\1. Let’s add two stars (i.e., **) just before our search expression.

# set regular expression for our search
search_expression <- "(argue)"

# simplest way
annotated_files %>%
  filter(grepl(search_expression, lemma)) %>% 
  select(sentence_number, sentence) %>%
  mutate(sentence = gsub(search_expression, "**\\1", sentence))

## # A tibble: 17 x 2
##    sentence_number sentence                                                     
##              <dbl> <chr>                                                        
##  1             178 One could **argue that the United States is already concerne…
##  2             966 Levi **argues that the reaction of those in the Lager , whil…
##  3            1487 I will **argue that it is this fractured identity combined w…
##  4            1491 McCullough **argues that for the Casas family , Cuban exile …
##  5            1511 Obejas **argues that `` culturally we 're deﬁned by our fami…
##  6            1607 However , I would **argue that McCullough 's assertion shoul…
##  7            1788 Jesus is seemingly arguing that men and women are equal in t…
##  8            1833 In fact , it is arguable that the Gospel of Luke includes mo…
##  9            1986 The invention of the printing press and the rise of literary…
## 10            1993 As Zipes **argues , however , industrialization itself was n…
## 11            2001 Zipes **argues that technology was the fuel for Walt Disney …
## 12            2015 Jack Zipes **argues that fairy tales should be accessible to…
## 13            2462 Stephen then goes on to refer to Arius , the first century h…
## 14            3734 In another vain , it could be **argued that these authors ar…
## 15            3741 This point must be made , because some may **argue that Onda…
## 16            3775 Assuming that a very small portion of the books audience spe…
## 17            3989 While some may **argue that the use of graphics is a way to …

CHALLENGE:

Applying what we’ve seen about regular expressions, how can you change the search expression to highlight different patterns of use in the corpus?

5.2 Key Word In Context (KWIC)

We can create a new variable in our data to indicate whether a lemma matches our search expression.

# kwic way
annotated_files %>%
  mutate(kwic = ifelse(grepl(search_expression, lemma),
                       TRUE, FALSE))

token	kwic
argue	TRUE
argues	TRUE
argue	TRUE
argues	TRUE
argues	TRUE
argue	TRUE

We can then create the before and after context for each lemma.

# kwic way
annotated_files %>%
  mutate(kwic = ifelse(grepl(search_expression, lemma),
                       TRUE, FALSE)) %>%
  mutate(before = paste(lag(token, 3), lag(token, 2), lag(token)),
         after = paste(lead(token), lead(token, 2), lead(token, 3)))

before	token	after
NA NA NA	As	Virginia Woof sees
NA NA As	Virginia	Woof sees it
NA As Virginia	Woof	sees it ,
As Virginia Woof	sees	it , the
Virginia Woof sees	it	, the traditional
Woof sees it	,	the traditional roles

Let’s clean up the ugly NAs.

# kwic way
annotated_files %>%
  mutate(kwic = ifelse(grepl(search_expression, lemma),
                       TRUE, FALSE)) %>%
  mutate(before = gsub("NA\\s", "", paste(lag(token, 3), lag(token, 2), lag(token))),
         after = gsub("NA\\s", "", paste(lead(token), lead(token, 2), lead(token, 3)))
  )

before	token	after
	As	Virginia Woof sees
As	Virginia	Woof sees it
As Virginia	Woof	sees it ,
As Virginia Woof	sees	it , the
Virginia Woof sees	it	, the traditional
Woof sees it	,	the traditional roles

Finally, we filter for only lemmas that match our search (i.e., kwic is TRUE) and select() only the variables before, token and after in this order.

# kwic way
annotated_files %>%
  mutate(kwic = ifelse(grepl(search_expression, lemma),
                       TRUE, FALSE)) %>%
  mutate(before = gsub("NA\\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
         after = gsub("NA\\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
  ) %>%
  filter(kwic) %>%
  select(before, token, after) %>%
  head() %>%
  kable()

before	token	after
. One could	argue	that the United
follows ? Levi	argues	that the reaction
. I will	argue	that it is
impurity . McCullough	argues	that for the
recognized . Obejas	argues	that `` culturally
, I would	argue	that McCullough ’s

Here’s the entire 02-corpus-searches.R code:

# load library
library(tidyverse)

# read data in
annotated_files <- read_csv("processed_corpus/micusp_engl_subset.csv")

# get concordance lines
search_expression <- "(argue)"

# simplest way
annotated_files %>%
  filter(grepl(search_expression, lemma)) %>% 
  select(sentence_number, sentence) %>%
  mutate(sentence = gsub(search_expression, "**\\1", sentence))

# kwic way
annotated_files %>%
  mutate(kwic = ifelse(grepl(search_expression, lemma),
                       TRUE, FALSE)) %>%
  mutate(before = gsub("NA\\s", "", paste(lag(token, 3), lag(token, 2), lag(token))),
         after = gsub("NA\\s", "", paste(lead(token), lead(token, 2), lead(token, 3)))
  ) %>%
  filter(kwic) %>%
  select(before, token, after) %>%
  view()

# another way, which gets the whole sentence
search_results <- annotated_files %>%
  mutate(kwic = ifelse(grepl(search_expression, lemma),
                       TRUE, FALSE)) 

concordance_lines <- data.frame()
for (i in 1:nrow(search_results)) {
  if (search_results$kwic[i]) {
    # get sentence for this kwic
    selected_sentence_number <- search_results$sentence_number[i]
    selected_sentence <- search_results %>%
      filter(sentence_number == selected_sentence_number)
    
    kwic_token_number <- search_results$token_number[i]
    
    context_before <- selected_sentence %>%
      filter(token_number < kwic_token_number) %>%
      mutate(context_before = paste(token, collapse = " ")) %>%
      distinct(context_before) %>%
      pull(context_before)
    
    context_after <- selected_sentence %>%
      filter(token_number > kwic_token_number) %>%
      mutate(context_after = paste(token, collapse = " ")) %>%
      distinct(context_after) %>%
      pull(context_after)
    
    this_concordance_line <- data.frame(context_before = context_before,
                                        kwic = search_results$token[i],
                                        context_after = context_after)
    
    concordance_lines <- bind_rows(concordance_lines,
                                   this_concordance_line)
  }
}

############# INCLUDE PART OF SPEECH IN SEARCH #############################
# read pos_description data
pos_descriptions <- read_csv("auxiliary_data/pos_descriptions.csv")

# add pos_description to data
annotated_files <- left_join(annotated_files,
                             pos_descriptions)

# simplest way -- with lemma and pos
annotated_files %>%
  filter(grepl(search_expression, lemma)) %>% 
  filter(grepl("^V", pos)) %>%
  select(pos, pos_description, lemma, token, sentence_number, sentence) %>%
  mutate(sentence = gsub(search_expression, "**\\1", sentence)) %>%
  view()

# simplest way -- with just pos
annotated_files %>%
  filter(grepl("^V", pos)) %>%
  select(pos, pos_description, lemma, token, sentence_number, sentence) %>%
  mutate(sentence = gsub(search_expression, "**\\1", sentence)) %>%
  view()

6 MICUSP References

Michigan Corpus of Upper-level Student Papers. (2009). Ann Arbor, MI: The Regents of the University of Michigan

Römer, Ute & Matthew B. O’Donnell. 2011. From student hard drive to web corpus (part 1): The design, compilation and genre classification of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora. 6(2): 159-177. http://uteroemer.weebly.com/uploads/5/5/7/7/5577406/roemer_and_odonnell_corpora_article_2011.pdf

O’Donnell, Matthew B. & Römer, Ute. 2012. From student hard drive to web corpus (part 2): The annotation and online distribution of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora 7(1): 1-18. http://uteroemer.weebly.com/uploads/5/5/7/7/5577406/odonnell_and_roemer_corpora_article_2012.pdf

7 More Resources

Text Mining with R - Free online book.

8 Exit Survey

Please take some time to fill out this workshop’s exit survey.

Corpus Searches in R

Adriana Picoral

December 19, 2020