library(tidyverse)
library(readxl)
Reading and checking data
For this tutorial on how to read data in R, we will be using the data available as complementary material for Tagliamonte’s (2011) Variationist Sociolinguistics: Change, Observation, Interpretation textbook.
This tutorial will be using posit.cloud for our IDE (Integrated Development Environment). You can create a free account to follow along.
Video demonstrations
Getting started
We’ll be using two packages for this tutorial. RStudio should prompt you to install these once you save your .Rmd
file.
We can now read our data in (remember to download the data from the website, and place it in a data
folder in your project).
<- read_excel("data/data_set.xlsx") that_data
How many observations per participant?
Once you read your data, you should check to see if all your values are in the data. We can use count()
for that.
%>%
that_data count(Indiv, sort = TRUE)
Checking values of categorical variables
When coding data, humans often make mistakes (misspell category names, for example). We can count how many of each dependent variable we have in our data using count()
again.
%>%
that_data count(Dep.var)
Let’s do the same with verbs.
%>%
that_data count(Verbs.1)
This data is of course very clean. I’ve changed the original data, to insert an error in one of the value names. We will read in that data, check the values, and make the changes to fix it.
We read the data in.
<- read_excel("data/data_set_corrupt.xlsx") corrupt_data
Then we count the verbs. Notice the other
value with one occurrence in our data.
%>%
corrupt_data count(Verbs.1)
We can certainly fix that in our original data file. But here’s how to fix it in R:
<- corrupt_data %>%
corrupt_data mutate(Verbs.1.fixed = case_when(Verbs.1 == "other" ~ "OTHER",
TRUE ~ Verbs.1))
Let’s check our verbs again.
%>%
corrupt_data count(Verbs.1.fixed)