Make sure you install both R and RStudio for this workshop.
Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download
If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0
by entering sessionInfo()
in your console.
sessionInfo()
I will be providing you with 50 text files from MISCUSP in the English discipline subset. I’d like to thank Dr. Ute Römer for allowing us to use these files for this workshop. For more information on MICUSP check the MICUSP references at the end of this page.
These text files have been annotated with the Stanford Dependency Parser and are thus in tab separated format (CoNLL-U Format). If you need instructions on how to tag your text files with the Stanfor Dependency Parser, check these workshop materials by Larissa Goulart.
This data is available for the purposes of this workshop only.
For this workshop, we will be using the tidyverse
package. Make sure you install it before proceeding.
install.packages("tidyverse")
Once the package is installed, you can load it using library()
.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The easiest way to read multiple files in R and combine all of them in the same data frame is by using a for
loop.
We first create a list of the files we want to read in with list.files()
.
# create list of files to read in
files <- list.files(path = "raw_corpus",
pattern = "*.txt.conll",
full.names = TRUE)
We then create an empty data frame with annotated_files <- data.frame()
which we bind_rows
with each file in our for
loop.
# read in all files
# first create empty data frame
annotated_files <- data.frame()
# for each file in our list of files
for (i in 1:length(files)){
# read the tab separated file in
this_file <- read_tsv(files[i], col_names = FALSE)
# create a column with the filename, do some clean up of the file name
this_file$filename <- gsub("\\.txt\\.conll|raw_corpus\\/", "", files[i])
# combine this file with all the others
annotated_files <- bind_rows(annotated_files,
this_file)
}
These annotated files do not have a header, so the column names are automatically provided by R and are X1
, X2
and so on. We can change the column names using rename()
# change column names
annotated_files <- annotated_files %>%
rename(token_number = X1,
token = X2,
lemma = X3,
pos = X4,
entity = X5,
dependency = X6,
dep_label = X7)
Right now we only have information about each individual token. Adding which tokens belong together in the same sentence allows us to retrieve the whole sentence in a search. The first step is to add a sentence number. Again we are going to use a for
loop and a sentence counter (i.e., sentence_count
) that increases by 1
every time a new sentence starts (i.e., the token number is equal to 1).
# add sentence number
# this for loop takes a while to run
# start with creating a column for sentence number
annotated_files$sentence_number <- NA
# start the sentence counter
sentence_count <- 0
# for every row/token in our corpus
for (i in 1:nrow(annotated_files)) {
# if the token number is one, that indicates a new sentence start
if (annotated_files$token_number[i] == 1) {
# add to sentence counter
sentence_count <- sentence_count + 1
}
# add sentence_count to the appropriate column and row (i.e., row i)
annotated_files$sentence_number[i] <- sentence_count
}
Now that we have what tokens belong to the same sentence together, we can create a new column in our data with all tokens collapsed together.
# now that we have sentence number, collapse tokens by sentence in a new
# sentence column/variable
annotated_files <- annotated_files %>%
group_by(sentence_number) %>%
mutate(sentence = paste(token, collapse = " ")) %>%
ungroup()
Always ungroup()
when creating new object that results from a group_by()
– this will save you some trouble in the future.
We are done with processing our corpus. To save this processing so you don’t have to run this every time you need to analyze or search your corpus, save the data frame to disk.
# write file out so we don't have to run the whole thing again
write_csv(annotated_files, "processed_corpus/micusp_engl_subset.csv")
Here’s the entire 01-read-corpus-files.R code:
# load libraries
library(tidyverse)
# create list of files to read in
files <- list.files(path = "raw_corpus",
pattern = "*.txt.conll",
full.names = TRUE)
# read in all files
# first create empty data frame
annotated_files <- data.frame()
# for each file in our list of files
for (i in 1:length(files)){
# read the tab separated file in
this_file <- read_tsv(files[i], col_names = FALSE)
# create a column with the filename, do some clean up of the file name
this_file$filename <- gsub("\\.txt\\.conll|raw_corpus\\/", "", files[i])
# combine this file with all the others
annotated_files <- bind_rows(annotated_files,
this_file)
}
# change column names
annotated_files <- annotated_files %>%
rename(token_number = X1,
token = X2,
lemma = X3,
pos = X4,
entity = X5,
dependency = X6,
dep_label = X7)
# add sentence number
# this for loop takes a while to run
# start with creating a column for sentence number
annotated_files$sentence_number <- NA
# start the sentence counter
sentence_count <- 0
# for every row/token in our corpus
for (i in 1:nrow(annotated_files)) {
# if the token number is one, that indicates a new sentence start
if (annotated_files$token_number[i] == 1) {
# add to sentence counter
sentence_count <- sentence_count + 1
}
# add sentence_count to the appropriate column and row (i.e., row i)
annotated_files$sentence_number[i] <- sentence_count
}
# now that we have sentence number, collapse tokens by sentence in a new
# sentence column/variable
annotated_files <- annotated_files %>%
group_by(sentence_number) %>%
mutate(sentence = paste(token, collapse = " ")) %>%
ungroup()
# write file out so we don't have to run the whole thing again
write_csv(annotated_files, "processed_corpus/micusp_engl_subset.csv")
I start my second R script with loading libraries and my data.
# load library
library(tidyverse)
# read data in
annotated_files <- read_csv("processed_corpus/micusp_engl_subset.csv")
Let’s start with the simplest way to search our corpus.
# set search expression to desired string
search_expression <- "argue"
# simplest way, just match the lemma with the search_expression
annotated_files %>%
filter(lemma == search_expression) %>%
select(sentence_number, sentence)
## # A tibble: 17 x 2
## sentence_number sentence
## <dbl> <chr>
## 1 178 One could argue that the United States is already concerned …
## 2 966 Levi argues that the reaction of those in the Lager , while …
## 3 1487 I will argue that it is this fractured identity combined wit…
## 4 1491 McCullough argues that for the Casas family , Cuban exile re…
## 5 1511 Obejas argues that `` culturally we 're defined by our famili…
## 6 1607 However , I would argue that McCullough 's assertion should …
## 7 1788 Jesus is seemingly arguing that men and women are equal in t…
## 8 1833 In fact , it is arguable that the Gospel of Luke includes mo…
## 9 1986 The invention of the printing press and the rise of literary…
## 10 1993 As Zipes argues , however , industrialization itself was not…
## 11 2001 Zipes argues that technology was the fuel for Walt Disney 's…
## 12 2015 Jack Zipes argues that fairy tales should be accessible to e…
## 13 2462 Stephen then goes on to refer to Arius , the first century h…
## 14 3734 In another vain , it could be argued that these authors are …
## 15 3741 This point must be made , because some may argue that Ondaat…
## 16 3775 Assuming that a very small portion of the books audience spe…
## 17 3989 While some may argue that the use of graphics is a way to du…
In this workshop, we will use two functions that make use of regular expressions:
grepl()
gsub()
In your console, enter first ?grepl()
and then ?gsub()
to read the help pages for these functions.
The function grepl()
returns TRUE
for when it finds a match and FALSE
when it doesn’t, which makes it a good function to use for filtering data. Here are few simple examples of grepl()
use.
# the pattern "argue" can be found in "argues"
grepl("argue", "argues")
## [1] TRUE
# the pattern "argue" is not found in "arguing"
grepl("argue", "arguing")
## [1] FALSE
# use period to mean any character
grepl("argu.", "arguing")
## [1] TRUE
# period means ANY character
grepl("argu.", c("argues", "arguing", "argu7"))
## [1] TRUE TRUE TRUE
# you can use [a-z] to mean any lower-case letter
grepl("argu[a-z]", c("argues", "arguing", "argu7"))
## [1] TRUE TRUE FALSE
Here’s a Basic Regular Expressions in R Cheat Sheet.
The function gsub()
replaces global (i.e., all) matches with a replacement string, which makes it a good function to use for changing data with mutate()
. Here are few simple examples of gsub()
use.
# the pattern "es" is found in "argues" and is replaced with "ing"
sub("es", "ing", "argues")
## [1] "arguing"
# as with grepl, you the string to find the pattern in can be a vector
sub("es", "ing", c("argues", "argue"))
## [1] "arguing" "argue"
# use ? to make a character optional
sub("es?", "ing", c("argues", "argue"))
## [1] "arguing" "arguing"
# use gsub to replace all matches, not just the first one
sub("es?", "ing", "He argues and I argue")
## [1] "Hing argues and I argue"
# use gsub to replace all matches, not just the first one
gsub("es?", "ing", "He argues and I argue")
## [1] "Hing arguing and I arguing"
# to match something and repeat that something in the replacement we can use
# grouping like \\1 and \\2 and so on. Groups are defined by parentheses.
# use + to mean one or more of something
gsub("([a-z]+)es?", "will be \\1ing", "He argues and I argue")
## [1] "He will be arguing and I will be arguing"
Going back to our original corpus search. Instead of matching exactly the same string, we can use the function grepl()
to use our search_expression
as a regular expression.
# simplest way, just match the lemma with the search_expression
annotated_files %>%
filter(grepl(search_expression, lemma)) %>%
select(sentence_number, sentence)
## # A tibble: 17 x 2
## sentence_number sentence
## <dbl> <chr>
## 1 178 One could argue that the United States is already concerned …
## 2 966 Levi argues that the reaction of those in the Lager , while …
## 3 1487 I will argue that it is this fractured identity combined wit…
## 4 1491 McCullough argues that for the Casas family , Cuban exile re…
## 5 1511 Obejas argues that `` culturally we 're defined by our famili…
## 6 1607 However , I would argue that McCullough 's assertion should …
## 7 1788 Jesus is seemingly arguing that men and women are equal in t…
## 8 1833 In fact , it is arguable that the Gospel of Luke includes mo…
## 9 1986 The invention of the printing press and the rise of literary…
## 10 1993 As Zipes argues , however , industrialization itself was not…
## 11 2001 Zipes argues that technology was the fuel for Walt Disney 's…
## 12 2015 Jack Zipes argues that fairy tales should be accessible to e…
## 13 2462 Stephen then goes on to refer to Arius , the first century h…
## 14 3734 In another vain , it could be argued that these authors are …
## 15 3741 This point must be made , because some may argue that Ondaat…
## 16 3775 Assuming that a very small portion of the books audience spe…
## 17 3989 While some may argue that the use of graphics is a way to du…
We can highlight our search term in our results. For that, we need to change our search_expression
slightly by adding parentheses to it, so we can call it in our gsub()
function with \\1
. Let’s add two stars (i.e., **) just before our search expression.
# set regular expression for our search
search_expression <- "(argue)"
# simplest way
annotated_files %>%
filter(grepl(search_expression, lemma)) %>%
select(sentence_number, sentence) %>%
mutate(sentence = gsub(search_expression, "**\\1", sentence))
## # A tibble: 17 x 2
## sentence_number sentence
## <dbl> <chr>
## 1 178 One could **argue that the United States is already concerne…
## 2 966 Levi **argues that the reaction of those in the Lager , whil…
## 3 1487 I will **argue that it is this fractured identity combined w…
## 4 1491 McCullough **argues that for the Casas family , Cuban exile …
## 5 1511 Obejas **argues that `` culturally we 're defined by our fami…
## 6 1607 However , I would **argue that McCullough 's assertion shoul…
## 7 1788 Jesus is seemingly arguing that men and women are equal in t…
## 8 1833 In fact , it is arguable that the Gospel of Luke includes mo…
## 9 1986 The invention of the printing press and the rise of literary…
## 10 1993 As Zipes **argues , however , industrialization itself was n…
## 11 2001 Zipes **argues that technology was the fuel for Walt Disney …
## 12 2015 Jack Zipes **argues that fairy tales should be accessible to…
## 13 2462 Stephen then goes on to refer to Arius , the first century h…
## 14 3734 In another vain , it could be **argued that these authors ar…
## 15 3741 This point must be made , because some may **argue that Onda…
## 16 3775 Assuming that a very small portion of the books audience spe…
## 17 3989 While some may **argue that the use of graphics is a way to …
CHALLENGE:
Applying what we’ve seen about regular expressions, how can you change the search expression to highlight different patterns of use in the corpus?
We can create a new variable in our data to indicate whether a lemma matches our search expression.
# kwic way
annotated_files %>%
mutate(kwic = ifelse(grepl(search_expression, lemma),
TRUE, FALSE))
token | kwic |
---|---|
argue | TRUE |
argues | TRUE |
argue | TRUE |
argues | TRUE |
argues | TRUE |
argue | TRUE |
We can then create the before
and after
context for each lemma.
# kwic way
annotated_files %>%
mutate(kwic = ifelse(grepl(search_expression, lemma),
TRUE, FALSE)) %>%
mutate(before = paste(lag(token, 3), lag(token, 2), lag(token)),
after = paste(lead(token), lead(token, 2), lead(token, 3)))
before | token | after |
---|---|---|
NA NA NA | As | Virginia Woof sees |
NA NA As | Virginia | Woof sees it |
NA As Virginia | Woof | sees it , |
As Virginia Woof | sees | it , the |
Virginia Woof sees | it | , the traditional |
Woof sees it | , | the traditional roles |
Let’s clean up the ugly NAs
.
# kwic way
annotated_files %>%
mutate(kwic = ifelse(grepl(search_expression, lemma),
TRUE, FALSE)) %>%
mutate(before = gsub("NA\\s", "", paste(lag(token, 3), lag(token, 2), lag(token))),
after = gsub("NA\\s", "", paste(lead(token), lead(token, 2), lead(token, 3)))
)
before | token | after |
---|---|---|
As | Virginia Woof sees | |
As | Virginia | Woof sees it |
As Virginia | Woof | sees it , |
As Virginia Woof | sees | it , the |
Virginia Woof sees | it | , the traditional |
Woof sees it | , | the traditional roles |
Finally, we filter for only lemmas that match our search (i.e., kwic
is TRUE
) and select()
only the variables before
, token
and after
in this order.
# kwic way
annotated_files %>%
mutate(kwic = ifelse(grepl(search_expression, lemma),
TRUE, FALSE)) %>%
mutate(before = gsub("NA\\s?", "", paste(lag(token, 3), lag(token, 2), lag(token))),
after = gsub("NA\\s?", "", paste(lead(token), lead(token, 2), lead(token, 3)))
) %>%
filter(kwic) %>%
select(before, token, after) %>%
head() %>%
kable()
before | token | after |
---|---|---|
. One could | argue | that the United |
follows ? Levi | argues | that the reaction |
. I will | argue | that it is |
impurity . McCullough | argues | that for the |
recognized . Obejas | argues | that `` culturally |
, I would | argue | that McCullough ’s |
Here’s the entire 02-corpus-searches.R code:
# load library
library(tidyverse)
# read data in
annotated_files <- read_csv("processed_corpus/micusp_engl_subset.csv")
# get concordance lines
search_expression <- "(argue)"
# simplest way
annotated_files %>%
filter(grepl(search_expression, lemma)) %>%
select(sentence_number, sentence) %>%
mutate(sentence = gsub(search_expression, "**\\1", sentence))
# kwic way
annotated_files %>%
mutate(kwic = ifelse(grepl(search_expression, lemma),
TRUE, FALSE)) %>%
mutate(before = gsub("NA\\s", "", paste(lag(token, 3), lag(token, 2), lag(token))),
after = gsub("NA\\s", "", paste(lead(token), lead(token, 2), lead(token, 3)))
) %>%
filter(kwic) %>%
select(before, token, after) %>%
view()
# another way, which gets the whole sentence
search_results <- annotated_files %>%
mutate(kwic = ifelse(grepl(search_expression, lemma),
TRUE, FALSE))
concordance_lines <- data.frame()
for (i in 1:nrow(search_results)) {
if (search_results$kwic[i]) {
# get sentence for this kwic
selected_sentence_number <- search_results$sentence_number[i]
selected_sentence <- search_results %>%
filter(sentence_number == selected_sentence_number)
kwic_token_number <- search_results$token_number[i]
context_before <- selected_sentence %>%
filter(token_number < kwic_token_number) %>%
mutate(context_before = paste(token, collapse = " ")) %>%
distinct(context_before) %>%
pull(context_before)
context_after <- selected_sentence %>%
filter(token_number > kwic_token_number) %>%
mutate(context_after = paste(token, collapse = " ")) %>%
distinct(context_after) %>%
pull(context_after)
this_concordance_line <- data.frame(context_before = context_before,
kwic = search_results$token[i],
context_after = context_after)
concordance_lines <- bind_rows(concordance_lines,
this_concordance_line)
}
}
############# INCLUDE PART OF SPEECH IN SEARCH #############################
# read pos_description data
pos_descriptions <- read_csv("auxiliary_data/pos_descriptions.csv")
# add pos_description to data
annotated_files <- left_join(annotated_files,
pos_descriptions)
# simplest way -- with lemma and pos
annotated_files %>%
filter(grepl(search_expression, lemma)) %>%
filter(grepl("^V", pos)) %>%
select(pos, pos_description, lemma, token, sentence_number, sentence) %>%
mutate(sentence = gsub(search_expression, "**\\1", sentence)) %>%
view()
# simplest way -- with just pos
annotated_files %>%
filter(grepl("^V", pos)) %>%
select(pos, pos_description, lemma, token, sentence_number, sentence) %>%
mutate(sentence = gsub(search_expression, "**\\1", sentence)) %>%
view()
Michigan Corpus of Upper-level Student Papers. (2009). Ann Arbor, MI: The Regents of the University of Michigan
Römer, Ute & Matthew B. O’Donnell. 2011. From student hard drive to web corpus (part 1): The design, compilation and genre classification of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora. 6(2): 159-177. http://uteroemer.weebly.com/uploads/5/5/7/7/5577406/roemer_and_odonnell_corpora_article_2011.pdf
O’Donnell, Matthew B. & Römer, Ute. 2012. From student hard drive to web corpus (part 2): The annotation and online distribution of the Michigan Corpus of Upper-level Student Papers (MICUSP). Corpora 7(1): 1-18. http://uteroemer.weebly.com/uploads/5/5/7/7/5577406/odonnell_and_roemer_corpora_article_2012.pdf
Please take some time to fill out this workshop’s exit survey.