Make sure you install both R and RStudio for this workshop.
Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download
If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0
by entering sessionInfo()
in your console.
sessionInfo()
For this workshop, we will be using two R packages. Make sure you install these packages before proceeding.
install.packages("tidyverse")
install.packages("effects")
Once the packages are installed, you can load them using library()
.
library(tidyverse)
library(effects)
We will be using the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” [@tagliamonte2012variationist] since these data are public and we can check our results with the results presented in the book. I highly recommend reading chapter 5 (Quantitative Analysis) for a better understanding of the data.
We have two data files: one that contains the tokens or individual observations of the language structure (i.e., verb complement clauses) and another with the extra-linguistic or social variables for each individual in the data.
# read data in
that_data <- read_csv("data/compl_that_data.csv")
individual_data <- read_csv("data/individual_data.csv")
# inspect data
that_data
dep_var | dep_var_numeric | verb_expanded | verb | matrix_subj | add_elm | sub_subj | int_mat | tense | indiv | context |
---|---|---|---|---|---|---|---|---|---|---|
that_expressed | 0 | Know | Know | Other Pronoun | Something | Other | Some | Past | acork | …we knew THAT a lot of um, of- of er flyers hadn ’t |
that_expressed | 0 | Point out | OTHER | Other Pronoun | Something | Other | Some | Past | acork | in the music lesson she was always um, pointing out |
that_omitted | 1 | Know | Know | First Person Singular | Something | Personal Pronoun | Some | Present | acork | I know [¬Ø] they couldn ’t |
that_omitted | 1 | Suppose | OTHER | First Person Singular | Something | Personal Pronoun | Some | Present | acork | …I suppose [¬Ø] it ’s too late now. |
Note that these data frames are tidy, meaning they show the following characteristics:
Here’s what the participant data frame looks like.
# inspect data
individual_data
indiv | sex | age_3way | age | occ | edu |
---|---|---|---|---|---|
acork | F | Older (66+) | 76 | White-Collar | Up to minimum age |
bhamilton | M | Older (66+) | 91 | Blue-Collar | Up to minimum age |
cbeale | M | Younger (under 35) | 30 | Blue-Collar | Up to minimum age |
cbeckett | M | Younger (under 35) | 29 | White-Collar | More than minimum age |
cgiles | M | Younger (under 35) | 20 | White-Collar | More than minimum age |
conniebeale | F | Middle (36-65) | 48 | Blue-Collar | Up to minimum age |
dburns | M | Middle (36-65) | 60 | White-Collar | More than minimum age |
ddavis | M | Younger (under 35) | 19 | Blue-Collar | Up to minimum age |
dwallis | M | Older (66+) | 72 | Blue-Collar | Up to minimum age |
eburritt | F | Older (66+) | 82 | White-Collar | Up to minimum age |
echapman | F | Middle (36-65) | 55 | White-Collar | More than minimum age |
ggreen | F | Middle (36-65) | 45 | White-Collar | Up to minimum age |
hbutton | M | Older (66+) | 68 | Blue-Collar | Up to minimum age |
hstainton | M | Middle (36-65) | 59 | Blue-Collar | Up to minimum age |
jtweddle | M | Older (66+) | 78 | White-Collar | Up to minimum age |
kyoung | F | Younger (under 35) | 30 | White-Collar | Up to minimum age |
lmcgrath | F | Younger (under 35) | 27 | White-Collar | Up to minimum age |
mjohnstone | M | Middle (36-65) | 53 | White-Collar | More than minimum age |
mlondry | F | Middle (36-65) | 62 | Blue-Collar | Up to minimum age |
mpeters | F | Older (66+) | 81 | White-Collar | Up to minimum age |
mtoovey | M | Middle (36-65) | 40 | White-Collar | More than minimum age |
nbond | F | Younger (under 35) | 34 | White-Collar | More than minimum age |
ocavell | M | Middle (36-65) | 40 | White-Collar | More than minimum age |
rbeale | M | Middle (36-65) | 52 | Blue-Collar | Up to minimum age |
rcotton | F | Younger (under 35) | 34 | White-Collar | Up to minimum age |
rjones | M | Middle (36-65) | 50 | White-Collar | Up to minimum age |
rmitchell | M | Younger (under 35) | 20 | Blue-Collar | Up to minimum age |
sboggin | F | Younger (under 35) | 23 | Blue-Collar | More than minimum age |
sevans | F | Older (66+) | 69 | White-Collar | More than minimum age |
swatkins | M | Middle (36-65) | 37 | White-Collar | More than minimum age |
vpriestly | M | Older (66+) | 84 | Blue-Collar | Up to minimum age |
wevans | M | Older (66+) | 72 | White-Collar | Up to minimum age |
We can get an idea of distribution of age across sex from our tidy participant data.
individual_data %>%
count(sex, age_3way) %>%
pivot_wider(names_from = sex,
values_from = n)
age_3way | F | M |
---|---|---|
Middle (36-65) | 4 | 8 |
Older (66+) | 4 | 6 |
Younger (under 35) | 5 | 5 |
You can do this with any two variables.
individual_data %>%
count(edu, occ) %>%
pivot_wider(names_from = edu,
values_from = n)
occ | More than minimum age | Up to minimum age |
---|---|---|
Blue-Collar | 1 | 11 |
White-Collar | 10 | 10 |
This type of cross count helps you identify any correlation between variables. If we had all observations in two pairs, e.g., Blue-Collar x Up to minimum age and White-Collar x More than minimum age, that would mean that the education and occupation are measuring the exact same thing (they correlate perfectly) are are thus basically the same variable. In the case of highly correlated variables, you choose just one of the variables to enter in your regression.
It is harder to decide how to group variables that are linguistic (compared to calculating that omission per participant, for example). We can usually do that with one linguistic variable. Let’s check the distribution of that omission rate across the lexical verb in the matrix clause.
that_data %>%
group_by(verb, dep_var) %>%
summarise(n = n()) %>%
mutate(total = sum(n),
percent = n/total) %>%
filter(dep_var == "that_omitted") %>%
ggplot(aes(x = reorder(verb, percent),
y = percent)) +
geom_col() +
geom_label(aes(label = round(percent, digits = 2))) +
labs(x = "")
It seems think is the verb with the highest rate of that omission. How do we know these differences are actually significant? The easiest way to do so, in my opinion, is to run logistic regression with the binary variable (that omitted vs. that expressed) as the dependent variable.
Here’s how to run the logistic regression. We need to use a numeric 0 and 1 (i.e., binary) variable as the dependent variable. We will enter in the model the factors that Tagliamonte discusses in her book.
model_2 <- that_data %>%
glm(formula = dep_var_numeric ~ verb + matrix_subj + add_elm + sub_subj + int_mat + tense,
family = "binomial")
summary(model_2)
##
## Call:
## glm(formula = dep_var_numeric ~ verb + matrix_subj + add_elm +
## sub_subj + int_mat + tense, family = "binomial", data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8343 0.1907 0.2516 0.6431 1.7678
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8351 0.2603 3.208 0.001334 **
## verbOTHER -0.1508 0.2248 -0.671 0.502380
## verbSay 0.9343 0.2637 3.543 0.000395 ***
## verbThink 1.8151 0.2657 6.831 8.43e-12 ***
## matrix_subjNoun Phrase -1.2830 0.1788 -7.175 7.23e-13 ***
## matrix_subjOther Pronoun -1.3420 0.1614 -8.314 < 2e-16 ***
## add_elmSomething -0.6042 0.1434 -4.214 2.51e-05 ***
## sub_subjPersonal Pronoun 0.5615 0.1504 3.735 0.000188 ***
## int_matSome -0.6269 0.1503 -4.170 3.04e-05 ***
## tensePresent 0.7868 0.1418 5.547 2.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1929.2 on 1871 degrees of freedom
## Residual deviance: 1415.3 on 1862 degrees of freedom
## AIC: 1435.3
##
## Number of Fisher Scoring iterations: 6
What factor levels are significant? What are these factor levels significant against?
Note that the output for the estimates is in log odds, which can be easily converted to probabilities. But let’s visualize the effects instead, using the effect()
function which converts log odds to probabilities for us.
Here’s the effect of verb on that omission.
effect("verb", model_2) %>%
data.frame() %>%
ggplot(aes(x = reorder(verb, fit),
y = fit)) +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .2) +
geom_label(aes(label = round(fit, digits = 2))) +
labs(x = "lexical verb in matrix clause")
How does the plot above compare with the first bar plot on verbs and that omission?
Let’s look at the matrix subject type.
effect("matrix_subj", model_2) %>%
data.frame() %>%
ggplot(aes(x = reorder(matrix_subj, fit),
y = fit)) +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .2) +
geom_label(aes(label = round(fit, digits = 2))) +
labs(x = "matrix subject type")
How about the complement clause subject?
effect("sub_subj", model_2) %>%
data.frame() %>%
ggplot(aes(x = reorder(sub_subj, fit),
y = fit)) +
geom_errorbar(aes(ymin = lower,
ymax = upper),
width = .2) +
geom_label(aes(label = round(fit, digits = 2))) +
labs(x = "complement clause subject")
Please take 2-3 minutes to fill out a short survey on what you thought of this workshop.