1 Workshop prep
- 1.1 Install R and RStudio
- 1.2 Install and load libraries
2 Read and inspect data
3 Data analysis by social variables
4 Data analysis by linguistic variables (version 1)
- 4.1 Visualizing percentages of that omission across linguistic variables
5 Data analysis by linguistic variables (version 2)
6 Short survey
7 Other resources

1 Workshop prep

1.1 Install R and RStudio

Make sure you install both R and RStudio for this workshop.

Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()

1.2 Install and load libraries

For this workshop, we will be using two R packages. Make sure you install these packages before proceeding.

install.packages("tidyverse")
install.packages("effects")

Once the packages are installed, you can load them using library().

library(tidyverse)
library(effects)

2 Read and inspect data

We will be using the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” [@tagliamonte2012variationist] since these data are public and we can check our results with the results presented in the book. I highly recommend reading chapter 5 (Quantitative Analysis) for a better understanding of the data.

We have two data files: one that contains the tokens or individual observations of the language structure (i.e., verb complement clauses) and another with the extra-linguistic or social variables for each individual in the data.

# read data in
that_data <- read_csv("data/compl_that_data.csv")
individual_data <- read_csv("data/individual_data.csv")

# inspect data
that_data

dep_var	dep_var_numeric	verb_expanded	verb	matrix_subj	add_elm	sub_subj	int_mat	tense	indiv	context
that_expressed	0	Know	Know	Other Pronoun	Something	Other	Some	Past	acork	…we knew THAT a lot of um, of- of er flyers hadn ’t
that_expressed	0	Point out	OTHER	Other Pronoun	Something	Other	Some	Past	acork	in the music lesson she was always um, pointing out
that_omitted	1	Know	Know	First Person Singular	Something	Personal Pronoun	Some	Present	acork	I know [¬Ø] they couldn ’t
that_omitted	1	Suppose	OTHER	First Person Singular	Something	Personal Pronoun	Some	Present	acork	…I suppose [¬Ø] it ’s too late now.

Note that these data frames are tidy, meaning they show the following characteristics:

Each row in the data is a single observation
Each column in the data is a single variable

Here’s what the participant data frame looks like.

# inspect data
individual_data

indiv	sex	age_3way	age	occ	edu
acork	F	Older (66+)	76	White-Collar	Up to minimum age
bhamilton	M	Older (66+)	91	Blue-Collar	Up to minimum age
cbeale	M	Younger (under 35)	30	Blue-Collar	Up to minimum age
cbeckett	M	Younger (under 35)	29	White-Collar	More than minimum age
cgiles	M	Younger (under 35)	20	White-Collar	More than minimum age
conniebeale	F	Middle (36-65)	48	Blue-Collar	Up to minimum age
dburns	M	Middle (36-65)	60	White-Collar	More than minimum age
ddavis	M	Younger (under 35)	19	Blue-Collar	Up to minimum age
dwallis	M	Older (66+)	72	Blue-Collar	Up to minimum age
eburritt	F	Older (66+)	82	White-Collar	Up to minimum age
echapman	F	Middle (36-65)	55	White-Collar	More than minimum age
ggreen	F	Middle (36-65)	45	White-Collar	Up to minimum age
hbutton	M	Older (66+)	68	Blue-Collar	Up to minimum age
hstainton	M	Middle (36-65)	59	Blue-Collar	Up to minimum age
jtweddle	M	Older (66+)	78	White-Collar	Up to minimum age
kyoung	F	Younger (under 35)	30	White-Collar	Up to minimum age
lmcgrath	F	Younger (under 35)	27	White-Collar	Up to minimum age
mjohnstone	M	Middle (36-65)	53	White-Collar	More than minimum age
mlondry	F	Middle (36-65)	62	Blue-Collar	Up to minimum age
mpeters	F	Older (66+)	81	White-Collar	Up to minimum age
mtoovey	M	Middle (36-65)	40	White-Collar	More than minimum age
nbond	F	Younger (under 35)	34	White-Collar	More than minimum age
ocavell	M	Middle (36-65)	40	White-Collar	More than minimum age
rbeale	M	Middle (36-65)	52	Blue-Collar	Up to minimum age
rcotton	F	Younger (under 35)	34	White-Collar	Up to minimum age
rjones	M	Middle (36-65)	50	White-Collar	Up to minimum age
rmitchell	M	Younger (under 35)	20	Blue-Collar	Up to minimum age
sboggin	F	Younger (under 35)	23	Blue-Collar	More than minimum age
sevans	F	Older (66+)	69	White-Collar	More than minimum age
swatkins	M	Middle (36-65)	37	White-Collar	More than minimum age
vpriestly	M	Older (66+)	84	Blue-Collar	Up to minimum age
wevans	M	Older (66+)	72	White-Collar	Up to minimum age

We can get an idea of distribution of age across sex from our tidy participant data.

individual_data %>%
  count(sex, age_3way) %>%
  pivot_wider(names_from = sex,
              values_from = n)

age_3way	F	M
Middle (36-65)	4	8
Older (66+)	4	6
Younger (under 35)	5	5

You can do this with any two variables.

individual_data %>%
  count(edu, occ) %>%
  pivot_wider(names_from = edu,
              values_from = n)

occ	More than minimum age	Up to minimum age
Blue-Collar	1	11
White-Collar	10	10

This type of cross count helps you identify any correlation between variables. If we had all observations in two pairs, e.g., Blue-Collar x Up to minimum age and White-Collar x More than minimum age, that would mean that the education and occupation are measuring the exact same thing (they correlate perfectly) are are thus basically the same variable. In the case of highly correlated variables, you choose just one of the variables to enter in your regression.

3 Data analysis by social variables

We will start with the easiest grouping of the data. Rate of that omission by participant.

3.1 Calculating percentages of that omission per participant

We first create an object that holds the rate of omission for each individual. Note that many people will have this as their “raw” data – this is not ideal, since you can always calculate counts and percentages from tidy data (i.e., data that contains one observation per row and one variable per column) but the inverse is not true.

percent_omit_indiv <- that_data %>%
  group_by(indiv, dep_var) %>%
  summarise(n = n()) %>%
  mutate(total = sum(n),
         percent = n/total) %>%
  filter(dep_var == "that_omitted") %>%
  select(-dep_var) %>%
  left_join(individual_data)

# inspect data
percent_omit_indiv

indiv	n	total	percent	sex	age_3way	age	occ	edu
acork	21	32	0.6562500	F	Older (66+)	76	White-Collar	Up to minimum age
bhamilton	26	28	0.9285714	M	Older (66+)	91	Blue-Collar	Up to minimum age
cbeale	13	19	0.6842105	M	Younger (under 35)	30	Blue-Collar	Up to minimum age
cbeckett	51	62	0.8225806	M	Younger (under 35)	29	White-Collar	More than minimum age
cgiles	48	52	0.9230769	M	Younger (under 35)	20	White-Collar	More than minimum age
conniebeale	57	66	0.8636364	F	Middle (36-65)	48	Blue-Collar	Up to minimum age
dburns	125	148	0.8445946	M	Middle (36-65)	60	White-Collar	More than minimum age
ddavis	54	57	0.9473684	M	Younger (under 35)	19	Blue-Collar	Up to minimum age
dwallis	17	22	0.7727273	M	Older (66+)	72	Blue-Collar	Up to minimum age
eburritt	38	44	0.8636364	F	Older (66+)	82	White-Collar	Up to minimum age
echapman	105	134	0.7835821	F	Middle (36-65)	55	White-Collar	More than minimum age
ggreen	75	86	0.8720930	F	Middle (36-65)	45	White-Collar	Up to minimum age
hbutton	32	44	0.7272727	M	Older (66+)	68	Blue-Collar	Up to minimum age
hstainton	38	41	0.9268293	M	Middle (36-65)	59	Blue-Collar	Up to minimum age
jtweddle	12	15	0.8000000	M	Older (66+)	78	White-Collar	Up to minimum age
kyoung	70	100	0.7000000	F	Younger (under 35)	30	White-Collar	Up to minimum age
lmcgrath	39	44	0.8863636	F	Younger (under 35)	27	White-Collar	Up to minimum age
mjohnstone	8	15	0.5333333	M	Middle (36-65)	53	White-Collar	More than minimum age
mlondry	38	48	0.7916667	F	Middle (36-65)	62	Blue-Collar	Up to minimum age
mpeters	49	55	0.8909091	F	Older (66+)	81	White-Collar	Up to minimum age
mtoovey	47	60	0.7833333	M	Middle (36-65)	40	White-Collar	More than minimum age
nbond	128	167	0.7664671	F	Younger (under 35)	34	White-Collar	More than minimum age
ocavell	23	35	0.6571429	M	Middle (36-65)	40	White-Collar	More than minimum age
rbeale	4	5	0.8000000	M	Middle (36-65)	52	Blue-Collar	Up to minimum age
rcotton	100	123	0.8130081	F	Younger (under 35)	34	White-Collar	Up to minimum age
rjones	43	67	0.6417910	M	Middle (36-65)	50	White-Collar	Up to minimum age
rmitchell	49	51	0.9607843	M	Younger (under 35)	20	Blue-Collar	Up to minimum age
sboggin	24	24	1.0000000	F	Younger (under 35)	23	Blue-Collar	More than minimum age
sevans	56	107	0.5233645	F	Older (66+)	69	White-Collar	More than minimum age
swatkins	23	32	0.7187500	M	Middle (36-65)	37	White-Collar	More than minimum age
vpriestly	15	15	1.0000000	M	Older (66+)	84	Blue-Collar	Up to minimum age
wevans	49	74	0.6621622	M	Older (66+)	72	White-Collar	Up to minimum age

We are now ready to visualize our results (which are just descriptive).

3.2 Visualizing that omission across extra-linguistic groups

Let’s start a bar plot with rate of that omission across occupation.

percent_omit_indiv %>%
  group_by(occ) %>%
  summarise(mean_omission = mean(percent)) %>%
  ggplot(aes(x = occ,
             y = mean_omission)) +
  geom_col() +
  geom_label(aes(label = round(mean_omission, 
                               digits = 2)))

We can add another categorical variable, and draw a line plot. Let’s visualize the rate of that omission across sex and occupation.

percent_omit_indiv %>%
  group_by(sex, occ) %>%
  summarise(mean_omission = mean(percent)) %>%
  ggplot(aes(x = occ,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point()

As we can see in the resulting plot (above), the lines are parallel, meaning the effect of sex across occupation levels seems to be the same (no interaction between sex and occupation).

Let’s visualize the rate of that omission across sex and education.

percent_omit_indiv %>%
  group_by(sex, edu) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(edu = factor(edu,
                      levels = c("Up to minimum age",
                                 "More than minimum age"))) %>%
  ggplot(aes(x = edu,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point()

The lines in the plot above are not parallel (they cross), which means that the effect of sex across education is not the same (there seems to be an interaction between sex and education).

Let’s visualize sex and age group for that omission rate.

percent_omit_indiv %>%
  group_by(sex, age_3way) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)"))) %>%
  ggplot(aes(x = age_3way,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point()

Based on the plot above, do you think there’s an interaction between sex and age?

We can complicate things by visualizing all of these four categorical variables.

percent_omit_indiv %>%
  group_by(sex, age_3way, occ, edu) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)")),
         edu = factor(edu,
                      levels = c("Up to minimum age",
                                 "More than minimum age"))) %>%
  ggplot(aes(x = age_3way,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() +
  facet_wrap(~occ+edu, ncol = 2)

These exploratory visualizations are useful for deciding what variables to include in our linear regression, and what type of interactions we should be checking for.

3.3 Linear regression

We can run linear regression on the that omission rate per participant data by using our percentage of omission as our dependent variable.

percent_omit_indiv %>%
  lm(formula = percent ~ sex + age_3way + occ + edu + edu:sex) %>%
  summary()

## 
## Call:
## lm(formula = percent ~ sex + age_3way + occ + edu + edu:sex, 
##     data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.219488 -0.071606  0.003609  0.091543  0.156771 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.82215    0.07709  10.665 8.61e-11 ***
## sexM                        0.04067    0.07765   0.524   0.6050    
## age_3wayOlder (66+)         0.01872    0.05702   0.328   0.7454    
## age_3wayYounger (under 35)  0.07919    0.05112   1.549   0.1340    
## occWhite-Collar            -0.13076    0.05637  -2.320   0.0288 *  
## eduUp to minimum age        0.06220    0.06983   0.891   0.3816    
## sexM:eduUp to minimum age  -0.10051    0.10527  -0.955   0.3488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1155 on 25 degrees of freedom
## Multiple R-squared:  0.301,  Adjusted R-squared:  0.1332 
## F-statistic: 1.794 on 6 and 25 DF,  p-value: 0.1412

How do we interpret the results above? What factors are significant? Is the interaction between education and sex significant?

percent_omit_indiv %>%
  lm(formula = percent ~ occ) %>%
  summary()

## 
## Call:
## lm(formula = percent ~ occ, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23376 -0.09439  0.02634  0.08907  0.16595 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.86692    0.03279  26.440   <2e-16 ***
## occWhite-Collar -0.10980    0.04147  -2.647   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1136 on 30 degrees of freedom
## Multiple R-squared:  0.1894, Adjusted R-squared:  0.1624 
## F-statistic: 7.009 on 1 and 30 DF,  p-value: 0.0128

How does variance explain change across the two models?

Let’s save our linear regression model to an object, so we can visualize the effect of occupation.

model_1 <- percent_omit_indiv %>%
  lm(formula = percent ~ occ)

effect("occ", model_1) %>%
  data.frame() %>%
  ggplot(aes(x = occ,
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))

Let’s move on to the analysis of the linguistic variables in our data.

4 Data analysis by linguistic variables (version 1)

4.1 Visualizing percentages of that omission across linguistic variables

It is harder to decide how to group variables that are linguistic (compared to calculating that omission per participant, for example). We can usually do that with one linguistic variable. Let’s check the distribution of that omission rate across the lexical verb in the matrix clause.

that_data %>%
  group_by(verb, dep_var) %>%
  summarise(n = n()) %>%
  mutate(total = sum(n),
         percent = n/total) %>%
  filter(dep_var == "that_omitted") %>%
  ggplot(aes(x = reorder(verb, percent),
             y = percent)) +
  geom_col() +
  geom_label(aes(label = round(percent, digits = 2))) +
  labs(x = "")

It seems think is the verb with the highest rate of that omission. How do we know these differences are actually significant? The easiest way to do so, in my opinion, is to run logistic regression with the binary variable (that omitted vs. that expressed) as the dependent variable.

5 Data analysis by linguistic variables (version 2)

Here’s how to run the logistic regression. We need to use a numeric 0 and 1 (i.e., binary) variable as the dependent variable. We will enter in the model the factors that Tagliamonte discusses in her book.

model_2 <- that_data %>%
  glm(formula = dep_var_numeric ~ verb + matrix_subj + add_elm + sub_subj + int_mat + tense,
        family = "binomial")

summary(model_2)

## 
## Call:
## glm(formula = dep_var_numeric ~ verb + matrix_subj + add_elm + 
##     sub_subj + int_mat + tense, family = "binomial", data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8343   0.1907   0.2516   0.6431   1.7678  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.8351     0.2603   3.208 0.001334 ** 
## verbOTHER                 -0.1508     0.2248  -0.671 0.502380    
## verbSay                    0.9343     0.2637   3.543 0.000395 ***
## verbThink                  1.8151     0.2657   6.831 8.43e-12 ***
## matrix_subjNoun Phrase    -1.2830     0.1788  -7.175 7.23e-13 ***
## matrix_subjOther Pronoun  -1.3420     0.1614  -8.314  < 2e-16 ***
## add_elmSomething          -0.6042     0.1434  -4.214 2.51e-05 ***
## sub_subjPersonal Pronoun   0.5615     0.1504   3.735 0.000188 ***
## int_matSome               -0.6269     0.1503  -4.170 3.04e-05 ***
## tensePresent               0.7868     0.1418   5.547 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1929.2  on 1871  degrees of freedom
## Residual deviance: 1415.3  on 1862  degrees of freedom
## AIC: 1435.3
## 
## Number of Fisher Scoring iterations: 6

What factor levels are significant? What are these factor levels significant against?

Note that the output for the estimates is in log odds, which can be easily converted to probabilities. But let’s visualize the effects instead, using the effect() function which converts log odds to probabilities for us.

Here’s the effect of verb on that omission.

effect("verb", model_2) %>%
  data.frame() %>%
  ggplot(aes(x = reorder(verb, fit),
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))  +
  labs(x = "lexical verb in matrix clause")

How does the plot above compare with the first bar plot on verbs and that omission?

Let’s look at the matrix subject type.

effect("matrix_subj", model_2) %>%
  data.frame() %>%
  ggplot(aes(x = reorder(matrix_subj, fit),
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))  +
  labs(x = "matrix subject type")

How about the complement clause subject?

effect("sub_subj", model_2) %>%
  data.frame() %>%
  ggplot(aes(x = reorder(sub_subj, fit),
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))  +
  labs(x = "complement clause subject")

6 Short survey

Please take 2-3 minutes to fill out a short survey on what you thought of this workshop.

7 Other resources

Quantitative Language Data Analysis in R
Variationist Sociolinguistics: Change, Observation, Interpretation by Sali A. Tagliamonte

Quantitative Language Data Analysis and Visualization in R

Adriana Picoral

April 17, 2021