Reading and checking data

For this tutorial on how to read data in R, we will be using the data available as complementary material for Tagliamonte’s (2011) Variationist Sociolinguistics: Change, Observation, Interpretation textbook.

This tutorial will be using posit.cloud for our IDE (Integrated Development Environment). You can create a free account to follow along.

Video demonstrations

Getting started

We’ll be using two packages for this tutorial. RStudio should prompt you to install these once you save your .Rmd file.

library(tidyverse)
library(readxl)

We can now read our data in (remember to download the data from the website, and place it in a data folder in your project).

that_data <- read_excel("data/data_set.xlsx")

How many observations per participant?

Once you read your data, you should check to see if all your values are in the data. We can use count() for that.

that_data %>% 
  count(Indiv, sort = TRUE)

Checking values of categorical variables

When coding data, humans often make mistakes (misspell category names, for example). We can count how many of each dependent variable we have in our data using count() again.

that_data %>% 
  count(Dep.var)

Let’s do the same with verbs.

that_data %>% 
  count(Verbs.1)

This data is of course very clean. I’ve changed the original data, to insert an error in one of the value names. We will read in that data, check the values, and make the changes to fix it.

We read the data in.

corrupt_data <- read_excel("data/data_set_corrupt.xlsx")

Then we count the verbs. Notice the other value with one occurrence in our data.

corrupt_data %>% 
  count(Verbs.1)

We can certainly fix that in our original data file. But here’s how to fix it in R:

corrupt_data <- corrupt_data %>% 
  mutate(Verbs.1.fixed = case_when(Verbs.1 == "other" ~ "OTHER",
                                   TRUE ~ Verbs.1))

Let’s check our verbs again.

corrupt_data %>% 
  count(Verbs.1.fixed)