1 Workshop prep

1.1 Install R and RStudio

Make sure you install both R and RStudio for this workshop.

  1. Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)

  2. Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()

1.2 Install and load libraries

For this workshop, we will be using two R packages. Make sure you install these packages before proceeding.

install.packages("tidyverse")
install.packages("effects")

Once the packages are installed, you can load them using library().

library(tidyverse)
library(effects)

2 Read and inspect data

We will be using the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” [@tagliamonte2012variationist] since these data are public and we can check our results with the results presented in the book. I highly recommend reading chapter 5 (Quantitative Analysis) for a better understanding of the data.

We have two data files: one that contains the tokens or individual observations of the language structure (i.e., verb complement clauses) and another with the extra-linguistic or social variables for each individual in the data.

# read data in
that_data <- read_csv("data/compl_that_data.csv")
individual_data <- read_csv("data/individual_data.csv")
# inspect data
that_data
dep_var dep_var_numeric verb_expanded verb matrix_subj add_elm sub_subj int_mat tense indiv context
that_expressed 0 Know Know Other Pronoun Something Other Some Past acork …we knew THAT a lot of um, of- of er flyers hadn ’t
that_expressed 0 Point out OTHER Other Pronoun Something Other Some Past acork in the music lesson she was always um, pointing out
that_omitted 1 Know Know First Person Singular Something Personal Pronoun Some Present acork I know [¬Ø] they couldn ’t
that_omitted 1 Suppose OTHER First Person Singular Something Personal Pronoun Some Present acork …I suppose [¬Ø] it ’s too late now.

Note that these data frames are tidy, meaning they show the following characteristics:

Here’s what the participant data frame looks like.

# inspect data
individual_data
indiv sex age_3way age occ edu
acork F Older (66+) 76 White-Collar Up to minimum age
bhamilton M Older (66+) 91 Blue-Collar Up to minimum age
cbeale M Younger (under 35) 30 Blue-Collar Up to minimum age
cbeckett M Younger (under 35) 29 White-Collar More than minimum age
cgiles M Younger (under 35) 20 White-Collar More than minimum age
conniebeale F Middle (36-65) 48 Blue-Collar Up to minimum age
dburns M Middle (36-65) 60 White-Collar More than minimum age
ddavis M Younger (under 35) 19 Blue-Collar Up to minimum age
dwallis M Older (66+) 72 Blue-Collar Up to minimum age
eburritt F Older (66+) 82 White-Collar Up to minimum age
echapman F Middle (36-65) 55 White-Collar More than minimum age
ggreen F Middle (36-65) 45 White-Collar Up to minimum age
hbutton M Older (66+) 68 Blue-Collar Up to minimum age
hstainton M Middle (36-65) 59 Blue-Collar Up to minimum age
jtweddle M Older (66+) 78 White-Collar Up to minimum age
kyoung F Younger (under 35) 30 White-Collar Up to minimum age
lmcgrath F Younger (under 35) 27 White-Collar Up to minimum age
mjohnstone M Middle (36-65) 53 White-Collar More than minimum age
mlondry F Middle (36-65) 62 Blue-Collar Up to minimum age
mpeters F Older (66+) 81 White-Collar Up to minimum age
mtoovey M Middle (36-65) 40 White-Collar More than minimum age
nbond F Younger (under 35) 34 White-Collar More than minimum age
ocavell M Middle (36-65) 40 White-Collar More than minimum age
rbeale M Middle (36-65) 52 Blue-Collar Up to minimum age
rcotton F Younger (under 35) 34 White-Collar Up to minimum age
rjones M Middle (36-65) 50 White-Collar Up to minimum age
rmitchell M Younger (under 35) 20 Blue-Collar Up to minimum age
sboggin F Younger (under 35) 23 Blue-Collar More than minimum age
sevans F Older (66+) 69 White-Collar More than minimum age
swatkins M Middle (36-65) 37 White-Collar More than minimum age
vpriestly M Older (66+) 84 Blue-Collar Up to minimum age
wevans M Older (66+) 72 White-Collar Up to minimum age

We can get an idea of distribution of age across sex from our tidy participant data.

individual_data %>%
  count(sex, age_3way) %>%
  pivot_wider(names_from = sex,
              values_from = n)
age_3way F M
Middle (36-65) 4 8
Older (66+) 4 6
Younger (under 35) 5 5

You can do this with any two variables.

individual_data %>%
  count(edu, occ) %>%
  pivot_wider(names_from = edu,
              values_from = n)
occ More than minimum age Up to minimum age
Blue-Collar 1 11
White-Collar 10 10

This type of cross count helps you identify any correlation between variables. If we had all observations in two pairs, e.g., Blue-Collar x Up to minimum age and White-Collar x More than minimum age, that would mean that the education and occupation are measuring the exact same thing (they correlate perfectly) are are thus basically the same variable. In the case of highly correlated variables, you choose just one of the variables to enter in your regression.

3 Data analysis by social variables

We will start with the easiest grouping of the data. Rate of that omission by participant.

3.1 Calculating percentages of that omission per participant

We first create an object that holds the rate of omission for each individual. Note that many people will have this as their “raw” data – this is not ideal, since you can always calculate counts and percentages from tidy data (i.e., data that contains one observation per row and one variable per column) but the inverse is not true.

percent_omit_indiv <- that_data %>%
  group_by(indiv, dep_var) %>%
  summarise(n = n()) %>%
  mutate(total = sum(n),
         percent = n/total) %>%
  filter(dep_var == "that_omitted") %>%
  select(-dep_var) %>%
  left_join(individual_data)
# inspect data
percent_omit_indiv
indiv n total percent sex age_3way age occ edu
acork 21 32 0.6562500 F Older (66+) 76 White-Collar Up to minimum age
bhamilton 26 28 0.9285714 M Older (66+) 91 Blue-Collar Up to minimum age
cbeale 13 19 0.6842105 M Younger (under 35) 30 Blue-Collar Up to minimum age
cbeckett 51 62 0.8225806 M Younger (under 35) 29 White-Collar More than minimum age
cgiles 48 52 0.9230769 M Younger (under 35) 20 White-Collar More than minimum age
conniebeale 57 66 0.8636364 F Middle (36-65) 48 Blue-Collar Up to minimum age
dburns 125 148 0.8445946 M Middle (36-65) 60 White-Collar More than minimum age
ddavis 54 57 0.9473684 M Younger (under 35) 19 Blue-Collar Up to minimum age
dwallis 17 22 0.7727273 M Older (66+) 72 Blue-Collar Up to minimum age
eburritt 38 44 0.8636364 F Older (66+) 82 White-Collar Up to minimum age
echapman 105 134 0.7835821 F Middle (36-65) 55 White-Collar More than minimum age
ggreen 75 86 0.8720930 F Middle (36-65) 45 White-Collar Up to minimum age
hbutton 32 44 0.7272727 M Older (66+) 68 Blue-Collar Up to minimum age
hstainton 38 41 0.9268293 M Middle (36-65) 59 Blue-Collar Up to minimum age
jtweddle 12 15 0.8000000 M Older (66+) 78 White-Collar Up to minimum age
kyoung 70 100 0.7000000 F Younger (under 35) 30 White-Collar Up to minimum age
lmcgrath 39 44 0.8863636 F Younger (under 35) 27 White-Collar Up to minimum age
mjohnstone 8 15 0.5333333 M Middle (36-65) 53 White-Collar More than minimum age
mlondry 38 48 0.7916667 F Middle (36-65) 62 Blue-Collar Up to minimum age
mpeters 49 55 0.8909091 F Older (66+) 81 White-Collar Up to minimum age
mtoovey 47 60 0.7833333 M Middle (36-65) 40 White-Collar More than minimum age
nbond 128 167 0.7664671 F Younger (under 35) 34 White-Collar More than minimum age
ocavell 23 35 0.6571429 M Middle (36-65) 40 White-Collar More than minimum age
rbeale 4 5 0.8000000 M Middle (36-65) 52 Blue-Collar Up to minimum age
rcotton 100 123 0.8130081 F Younger (under 35) 34 White-Collar Up to minimum age
rjones 43 67 0.6417910 M Middle (36-65) 50 White-Collar Up to minimum age
rmitchell 49 51 0.9607843 M Younger (under 35) 20 Blue-Collar Up to minimum age
sboggin 24 24 1.0000000 F Younger (under 35) 23 Blue-Collar More than minimum age
sevans 56 107 0.5233645 F Older (66+) 69 White-Collar More than minimum age
swatkins 23 32 0.7187500 M Middle (36-65) 37 White-Collar More than minimum age
vpriestly 15 15 1.0000000 M Older (66+) 84 Blue-Collar Up to minimum age
wevans 49 74 0.6621622 M Older (66+) 72 White-Collar Up to minimum age

We are now ready to visualize our results (which are just descriptive).

3.2 Visualizing that omission across extra-linguistic groups

Let’s start a bar plot with rate of that omission across occupation.

percent_omit_indiv %>%
  group_by(occ) %>%
  summarise(mean_omission = mean(percent)) %>%
  ggplot(aes(x = occ,
             y = mean_omission)) +
  geom_col() +
  geom_label(aes(label = round(mean_omission, 
                               digits = 2)))

We can add another categorical variable, and draw a line plot. Let’s visualize the rate of that omission across sex and occupation.

percent_omit_indiv %>%
  group_by(sex, occ) %>%
  summarise(mean_omission = mean(percent)) %>%
  ggplot(aes(x = occ,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() 

As we can see in the resulting plot (above), the lines are parallel, meaning the effect of sex across occupation levels seems to be the same (no interaction between sex and occupation).

Let’s visualize the rate of that omission across sex and education.

percent_omit_indiv %>%
  group_by(sex, edu) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(edu = factor(edu,
                      levels = c("Up to minimum age",
                                 "More than minimum age"))) %>%
  ggplot(aes(x = edu,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() 

The lines in the plot above are not parallel (they cross), which means that the effect of sex across education is not the same (there seems to be an interaction between sex and education).

Let’s visualize sex and age group for that omission rate.

percent_omit_indiv %>%
  group_by(sex, age_3way) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)"))) %>%
  ggplot(aes(x = age_3way,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point()