Make sure you install both R and RStudio for this workshop.
Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download
If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0
by entering sessionInfo()
in your console.
sessionInfo()
For this workshop, we will be using two R packages. Make sure you install these packages before proceeding.
install.packages("tidyverse")
install.packages("effects")
Once the packages are installed, you can load them using library()
.
library(tidyverse)
library(effects)
We will be using the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” [@tagliamonte2012variationist] since these data are public and we can check our results with the results presented in the book. I highly recommend reading chapter 5 (Quantitative Analysis) for a better understanding of the data.
We have two data files: one that contains the tokens or individual observations of the language structure (i.e., verb complement clauses) and another with the extra-linguistic or social variables for each individual in the data.
# read data in
that_data <- read_csv("data/compl_that_data.csv")
individual_data <- read_csv("data/individual_data.csv")
# inspect data
that_data
dep_var | dep_var_numeric | verb_expanded | verb | matrix_subj | add_elm | sub_subj | int_mat | tense | indiv | context |
---|---|---|---|---|---|---|---|---|---|---|
that_expressed | 0 | Know | Know | Other Pronoun | Something | Other | Some | Past | acork | …we knew THAT a lot of um, of- of er flyers hadn ’t |
that_expressed | 0 | Point out | OTHER | Other Pronoun | Something | Other | Some | Past | acork | in the music lesson she was always um, pointing out |
that_omitted | 1 | Know | Know | First Person Singular | Something | Personal Pronoun | Some | Present | acork | I know [¬Ø] they couldn ’t |
that_omitted | 1 | Suppose | OTHER | First Person Singular | Something | Personal Pronoun | Some | Present | acork | …I suppose [¬Ø] it ’s too late now. |
Note that these data frames are tidy, meaning they show the following characteristics:
Here’s what the participant data frame looks like.
# inspect data
individual_data
indiv | sex | age_3way | age | occ | edu |
---|---|---|---|---|---|
acork | F | Older (66+) | 76 | White-Collar | Up to minimum age |
bhamilton | M | Older (66+) | 91 | Blue-Collar | Up to minimum age |
cbeale | M | Younger (under 35) | 30 | Blue-Collar | Up to minimum age |
cbeckett | M | Younger (under 35) | 29 | White-Collar | More than minimum age |
cgiles | M | Younger (under 35) | 20 | White-Collar | More than minimum age |
conniebeale | F | Middle (36-65) | 48 | Blue-Collar | Up to minimum age |
dburns | M | Middle (36-65) | 60 | White-Collar | More than minimum age |
ddavis | M | Younger (under 35) | 19 | Blue-Collar | Up to minimum age |
dwallis | M | Older (66+) | 72 | Blue-Collar | Up to minimum age |
eburritt | F | Older (66+) | 82 | White-Collar | Up to minimum age |
echapman | F | Middle (36-65) | 55 | White-Collar | More than minimum age |
ggreen | F | Middle (36-65) | 45 | White-Collar | Up to minimum age |
hbutton | M | Older (66+) | 68 | Blue-Collar | Up to minimum age |
hstainton | M | Middle (36-65) | 59 | Blue-Collar | Up to minimum age |
jtweddle | M | Older (66+) | 78 | White-Collar | Up to minimum age |
kyoung | F | Younger (under 35) | 30 | White-Collar | Up to minimum age |
lmcgrath | F | Younger (under 35) | 27 | White-Collar | Up to minimum age |
mjohnstone | M | Middle (36-65) | 53 | White-Collar | More than minimum age |
mlondry | F | Middle (36-65) | 62 | Blue-Collar | Up to minimum age |
mpeters | F | Older (66+) | 81 | White-Collar | Up to minimum age |
mtoovey | M | Middle (36-65) | 40 | White-Collar | More than minimum age |
nbond | F | Younger (under 35) | 34 | White-Collar | More than minimum age |
ocavell | M | Middle (36-65) | 40 | White-Collar | More than minimum age |
rbeale | M | Middle (36-65) | 52 | Blue-Collar | Up to minimum age |
rcotton | F | Younger (under 35) | 34 | White-Collar | Up to minimum age |
rjones | M | Middle (36-65) | 50 | White-Collar | Up to minimum age |
rmitchell | M | Younger (under 35) | 20 | Blue-Collar | Up to minimum age |
sboggin | F | Younger (under 35) | 23 | Blue-Collar | More than minimum age |
sevans | F | Older (66+) | 69 | White-Collar | More than minimum age |
swatkins | M | Middle (36-65) | 37 | White-Collar | More than minimum age |
vpriestly | M | Older (66+) | 84 | Blue-Collar | Up to minimum age |
wevans | M | Older (66+) | 72 | White-Collar | Up to minimum age |
We can get an idea of distribution of age across sex from our tidy participant data.
individual_data %>%
count(sex, age_3way) %>%
pivot_wider(names_from = sex,
values_from = n)
age_3way | F | M |
---|---|---|
Middle (36-65) | 4 | 8 |
Older (66+) | 4 | 6 |
Younger (under 35) | 5 | 5 |
You can do this with any two variables.
individual_data %>%
count(edu, occ) %>%
pivot_wider(names_from = edu,
values_from = n)
occ | More than minimum age | Up to minimum age |
---|---|---|
Blue-Collar | 1 | 11 |
White-Collar | 10 | 10 |
This type of cross count helps you identify any correlation between variables. If we had all observations in two pairs, e.g., Blue-Collar x Up to minimum age and White-Collar x More than minimum age, that would mean that the education and occupation are measuring the exact same thing (they correlate perfectly) are are thus basically the same variable. In the case of highly correlated variables, you choose just one of the variables to enter in your regression.