# load library for data analysis
library(tidyverse)
Reproducing Open Science Research 3
October 06, 10am to 11am. Session Recording
For this tutorial, we will be replicating some of the analysis in Chappell, W. & Kanwit, M. (2021). Do Learners Connect Sociophonetic Variation with Regional and Social Characteristics? The Case of L2 Perception of Spanish Aspiration. Studies in Second Language Acquisition. 44(1). 1–25.
Data can be found at https://www.iris-database.org/details/MQbI5-rz7z3
Overview of study
- Matched-guise test targeting coda /s/ in Spanish
- Question: Do L2 Spanish learners identify native speakers’ social characteristics based on phonetic variants
Data
Our first step it to load the tidyverse
library (RStudio will prompt you to install it).
Now we can read the data in. Remember to download the file and place it in a directory called data
in your project.
<- read_csv("data/L2PerceptionsofAspirationAnonymized.csv") l2_span_perception
Participants
Seventy-six language learners
We have multiple observations per participants, so to get to number of participants we need to use distinct()
%>%
l2_span_perception distinct(Participant) %>%
nrow()
[1] 76
Table 1 recreation (page 193):
%>%
l2_span_perception distinct(Participant, Gender) %>%
count(Gender)
Gender | n |
---|---|
Female | 55 |
Male | 21 |
%>%
l2_span_perception distinct(Participant, Age) %>%
summarize(min_age = min(Age),
max_age = max(Age),
mean_age = mean(Age),
median_age = median(Age))
min_age | max_age | mean_age | median_age |
---|---|---|---|
18 | 57 | 23.64474 | 21 |
%>%
l2_span_perception distinct(Participant, Education) %>%
count(Education, sort = TRUE)
Education | n |
---|---|
Some college | 43 |
College graduate | 27 |
High school | 5 |
Some high school | 1 |
Research questions
- Are L2-Spanish language learners able to perceive the social meanings of coda /s/ in their L2?
If so, what social properties will L2 Learners associate with coda [h]?
%>%
l2_span_perception ggplot(aes(x = VariantHeard,
y = PerceivedAge)) +
geom_boxplot()
%>%
l2_span_perception group_by(VariantHeard) %>%
summarize(mean_age = mean(PerceivedAge)) %>%
ggplot(aes(x = VariantHeard,
y = mean_age)) +
geom_col()
Factor Analysis
“[…] we conducted an FA and used the Kaiser Rules to establish which properties should be combined and analyzed as joint factors in the model-construction procedure”
For the Factor Analysis we need a data frame with only the indices we will be running on the analysis.
<- l2_span_perception %>%
indices select(Intelligent:Feminine)
Now we can run the FA (with 3 factors, since that’s what the authors did).
<- factanal(x = indices, factors = 3)
fa fa
Call:
factanal(x = indices, factors = 3)
Uniquenesses:
Intelligent Hardworking Nice Hispanic Confident DowntoEarth
0.351 0.407 0.536 0.487 0.415 0.303
GoodSpanish Feminine
0.469 0.771
Loadings:
Factor1 Factor2 Factor3
Intelligent 0.200 0.780
Hardworking 0.291 0.680 0.213
Nice 0.176 0.427 0.500
Hispanic 0.698 0.127
Confident 0.647 0.406
DowntoEarth 0.107 0.827
GoodSpanish 0.673 0.258 0.109
Feminine -0.443 -0.160
Factor1 Factor2 Factor3
SS loadings 1.723 1.509 1.028
Proportion Var 0.215 0.189 0.129
Cumulative Var 0.215 0.404 0.533
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 56.19 on 7 degrees of freedom.
The p-value is 8.66e-10
Then we inspect the loadings.
$loadings fa
Loadings:
Factor1 Factor2 Factor3
Intelligent 0.200 0.780
Hardworking 0.291 0.680 0.213
Nice 0.176 0.427 0.500
Hispanic 0.698 0.127
Confident 0.647 0.406
DowntoEarth 0.107 0.827
GoodSpanish 0.673 0.258 0.109
Feminine -0.443 -0.160
Factor1 Factor2 Factor3
SS loadings 1.723 1.509 1.028
Proportion Var 0.215 0.189 0.129
Cumulative Var 0.215 0.404 0.533
SS (sums of squares) loadings indicate the variance explained (eigenvalue/number of variables).
The FA motivated the creation of three combined factors: (a) a status factor (loading for intelligence and work ethic), (b) a confident Spanish-speaker factor (loading for Hispanicity, confidence, and good Spanish), and (c) a solidarity factor (loading for niceness and humility). As no other factors appeared to be correlated, they were explored independently.
library(psych)
scree(indices)
You can learn more about FA in this chapter by Rachael Smyth and Andrew Johnson.
Regression models
Mixed-effects regression models were then created using the lme4 (Bates et al., Reference Bates, Maechler, Bolker and Walker2017) and lmerTest (Kuznetsova et al., Reference Kuznetsova, Brockhoff and Christensen2016) packages in R (R Core Team, 2018), and individual models were fitted to the following dependent variables: (a) status (intelligence/work ethic), (b) confident Spanish speaker (Hispanicity/confidence/good Spanish), (c) solidarity (niceness/humility), (d) age, (e) femininity, and (f) perceived speaker origin. Treatment contrasts were used, and the random effects in each model included the listener and the presentation order of the stimuli.
The independent variables tested in each model include variant ([s] or [h]), speaker type (Mexican or Puerto Rican), having taken a phonetics class (yes or no), most advanced Spanish class taken divided into four collapsed categories (elementary, intermediate low, intermediate high, and advanced), number of weeks spent studying abroad (continuous), experience abroad with an aspirating variety (yes or no), whether participants use Spanish regularly with NSs (yes or no), whether participants use Spanish regularly at work (yes or no), whether participants listen regularly to Spanish media (e.g., shows, podcasts, movies, music [yes or no]), listener age (continuous), and listener gender (man, woman, or other).
dependent variable: status (intelligence/work ethic)
<- l2_span_perception %>%
l2_span_perception mutate(dep_status = Intelligent + Hardworking)
library(lme4)
library(lmerTest)
<- lmer(dep_status ~ VariantHeard + SpeakerType + PhoneticsClass +
model_status + WeeksAbroad + AspirationContact +
MaxClassBinned + UseAtWork + UseListeningToMedia + Age +
UseWithNSs + (1|Participant) + (1|AudioNumber),
Gender data = l2_span_perception)
<- step(model_status)
step_model_status step_model_status
Backward reduced random-effect table:
Eliminated npar logLik AIC LRT Df Pr(>Chisq)
<none> 17 -1273.4 2580.8
(1 | Participant) 0 16 -1378.3 2788.6 209.844 1 < 2.2e-16 ***
(1 | AudioNumber) 0 16 -1284.0 2600.0 21.204 1 4.129e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Backward reduced fixed-effect table:
Degrees of freedom method: Satterthwaite
Eliminated Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Gender 1 0.0623 0.0623 1 64.000 0.0474 0.8283
VariantHeard 2 0.1440 0.1440 1 7.000 0.1096 0.7503
Age 3 0.1393 0.1393 1 65.000 0.1060 0.7458
UseListeningToMedia 4 2.3116 2.3116 1 66.000 1.7593 0.1893
SpeakerType 5 2.9501 2.9501 1 8.000 2.2453 0.1724
UseAtWork 6 2.6552 2.6552 1 67.000 2.0209 0.1598
WeeksAbroad 7 2.9120 2.9120 1 67.999 2.2163 0.1412
UseWithNSs 8 2.2384 2.2384 1 69.000 1.7036 0.1961
PhoneticsClass 9 3.5431 3.5431 1 70.000 2.6967 0.1050
AspirationContact 10 3.4437 3.4437 1 71.000 2.6210 0.1099
MaxClassBinned 11 10.3315 3.4438 3 72.000 2.6211 0.0572 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model found:
dep_status ~ (1 | Participant) + (1 | AudioNumber)
<- lmer(dep_status ~ VariantHeard + PhoneticsClass +
model_status_2 :PhoneticsClass +
VariantHeard+ (1|Participant) + (1|AudioNumber),
MaxClassBinned data = l2_span_perception)
summary(model_status_2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula:
dep_status ~ VariantHeard + PhoneticsClass + VariantHeard:PhoneticsClass +
MaxClassBinned + (1 | Participant) + (1 | AudioNumber)
Data: l2_span_perception
REML criterion at convergence: 2540.6
Scaled residuals:
Min 1Q Median 3Q Max
-4.1648 -0.4831 0.0043 0.5349 3.3504
Random effects:
Groups Name Variance Std.Dev.
Participant (Intercept) 0.94126 0.9702
AudioNumber (Intercept) 0.09798 0.3130
Residual 1.31438 1.1465
Number of obs: 760, groups: Participant, 76; AudioNumber, 10
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.93969 0.26326 48.79329 3.569 0.000815
VariantHeards 0.03584 0.21773 8.45553 0.165 0.873137
PhoneticsClassYes -0.65228 0.36181 87.41817 -1.803 0.074864
MaxClassBinnedElementary -1.11418 0.42631 71.00059 -2.614 0.010932
MaxClassBinnedIntermediate High 0.17895 0.29424 71.00059 0.608 0.545003
MaxClassBinnedIntermediate Low -0.34794 0.32749 71.00059 -1.062 0.291635
VariantHeards:PhoneticsClassYes 0.19766 0.22810 674.00008 0.867 0.386480
(Intercept) ***
VariantHeards
PhoneticsClassYes .
MaxClassBinnedElementary *
MaxClassBinnedIntermediate High
MaxClassBinnedIntermediate Low
VariantHeards:PhoneticsClassYes
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) VrntHr PhntCY MxClBE MxCBIH MxCBIL
VariantHrds -0.414
PhntcsClssY -0.342 0.052
MxClssBnndE -0.425 0.000 0.198
MxClssBnnIH -0.529 0.000 0.046 0.327
MxClssBnnIL -0.553 0.000 0.258 0.341 0.426
VrntHrd:PCY 0.068 -0.165 -0.315 0.000 0.000 0.000
anova(model_status_2)
Sum Sq | Mean Sq | NumDF | DenDF | F value | Pr(>F) | |
---|---|---|---|---|---|---|
VariantHeard | 0.4566423 | 0.4566423 | 1 | 10.24927 | 0.3474209 | 0.5683413 |
PhoneticsClass | 3.4147173 | 3.4147173 | 1 | 71.00059 | 2.5979723 | 0.1114373 |
MaxClassBinned | 12.9231579 | 4.3077193 | 3 | 71.00059 | 3.2773827 | 0.0258835 |
VariantHeard:PhoneticsClass | 0.9870502 | 0.9870502 | 1 | 674.00008 | 0.7509638 | 0.3864796 |
library(MuMIn)
r.squaredGLMM(model_status_2)
R2m R2c
[1,] 0.06094177 0.4755826
speakers’ perceived place of origin (Caribbean vs. other)
Table 2 replication
<- l2_span_perception %>%
l2_span_perception mutate(origin_dep = if_else(PerceivedSpeakerOrigin == "Caribbean", 1, 0),
VariantHeard = factor(VariantHeard, levels = c("s", "h")))
<- lmer(origin_dep ~ VariantHeard + PhoneticsClass +
model_origin :PhoneticsClass +
VariantHeard+ (1|Participant) + (1|AudioNumber),
MaxClassBinned data = l2_span_perception)
summary(model_origin)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula:
origin_dep ~ VariantHeard + PhoneticsClass + VariantHeard:PhoneticsClass +
MaxClassBinned + (1 | Participant) + (1 | AudioNumber)
Data: l2_span_perception
REML criterion at convergence: 717.9
Scaled residuals:
Min 1Q Median 3Q Max
-2.2038 -0.5766 -0.2064 0.3768 2.7826
Random effects:
Groups Name Variance Std.Dev.
Participant (Intercept) 0.025294 0.15904
AudioNumber (Intercept) 0.004444 0.06666
Residual 0.129384 0.35970
Number of obs: 760, groups: Participant, 76; AudioNumber, 10
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.28275 0.05281 33.89773 5.354 6.02e-06
VariantHeardh 0.13750 0.05085 8.85226 2.704 0.0246
PhoneticsClassYes -0.08467 0.07404 119.68607 -1.144 0.2551
MaxClassBinnedElementary -0.20150 0.08048 70.99993 -2.504 0.0146
MaxClassBinnedIntermediate High -0.07540 0.05555 70.99993 -1.357 0.1790
MaxClassBinnedIntermediate Low -0.29595 0.06183 70.99993 -4.787 8.96e-06
VariantHeardh:PhoneticsClassYes 0.42917 0.07156 674.00004 5.997 3.28e-09
(Intercept) ***
VariantHeardh *
PhoneticsClassYes
MaxClassBinnedElementary *
MaxClassBinnedIntermediate High
MaxClassBinnedIntermediate Low ***
VariantHeardh:PhoneticsClassYes ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) VrntHr PhntCY MxClBE MxCBIH MxCBIL
VariantHrdh -0.481
PhntcsClssY -0.330 0.107
MxClssBnndE -0.400 0.000 0.183
MxClssBnnIH -0.498 0.000 0.043 0.327
MxClssBnnIL -0.520 0.000 0.238 0.341 0.426
VrntHrd:PCY 0.107 -0.222 -0.483 0.000 0.000 0.000
anova(model_origin)
Sum Sq | Mean Sq | NumDF | DenDF | F value | Pr(>F) | |
---|---|---|---|---|---|---|
VariantHeard | 5.2451299 | 5.2451299 | 1 | 12.36777 | 40.539246 | 0.0000311 |
PhoneticsClass | 0.5196316 | 0.5196316 | 1 | 70.99993 | 4.016197 | 0.0488812 |
MaxClassBinned | 3.1974421 | 1.0658140 | 3 | 70.99993 | 8.237603 | 0.0000890 |
VariantHeard:PhoneticsClass | 4.6530702 | 4.6530702 | 1 | 674.00004 | 35.963258 | 0.0000000 |
r.squaredGLMM(model_origin)
R2m R2c
[1,] 0.184729 0.3370942
To clarify this rather complex relationship, a conditional inference tree is provided in Figure 2.
Oh no
library(effects)
library(ggthemes)
effect("VariantHeard:PhoneticsClass", model_origin) %>%
data.frame() %>%
ggplot(aes(x = PhoneticsClass,
y = fit,
ymin = lower,
ymax = upper,
color = VariantHeard)) +
geom_errorbar() +
geom_label(aes(label = round(fit, 2))) +
theme_linedraw() +
scale_color_colorblind() +
labs(title = "Probability (fit) of participant saying speaker is Caribbean")
effect("MaxClassBinned", model_origin) %>%
data.frame() %>%
mutate(MaxClassBinned = factor(MaxClassBinned,
levels = c("Elementary",
"Intermediate Low",
"Intermediate High",
"Advanced"))) %>%
ggplot(aes(x = MaxClassBinned,
y = fit,
ymin = lower,
ymax = upper)) +
geom_errorbar() +
geom_label(aes(label = round(fit, 2))) +
theme_linedraw() +
labs(title = "Probability (fit) of participant saying speaker is Caribbean")
Conditional Inference Trees
library(partykit)
<- l2_span_perception %>%
l2_span_perception mutate(VariantHeard = factor(VariantHeard),
PhoneticsClass = factor(PhoneticsClass),
MaxClassBinned = factor(MaxClassBinned),
PerceivedSpeakerOrigin = factor(PerceivedSpeakerOrigin))
= ctree(PerceivedSpeakerOrigin ~ VariantHeard + PhoneticsClass + MaxClassBinned,
ctree_model data = l2_span_perception)
plot(ctree_model)
Accuracy
%>%
l2_span_perception group_by(VariantHeard, SpeakerType, PerceivedSpeakerOrigin) %>%
summarize(count = n()) %>%
mutate(total = sum(count),
percent = round((count/total)*100, 2)) %>%
ggplot(aes(x = PerceivedSpeakerOrigin, y = SpeakerType,
fill = count)) +
geom_tile() +
geom_label(aes(label = paste0(percent, "%")),
fill = "white") +
facet_wrap(~VariantHeard) +
scale_fill_gradient(low = "lightgray" , high = "black")
<- l2_span_perception %>%
l2_span_perception mutate(accuracy = case_when(SpeakerType == "Mexican" &
== "Mexico" ~ 1,
PerceivedSpeakerOrigin == "Puerto Rican" &
SpeakerType == "Caribbean" ~ 1,
PerceivedSpeakerOrigin TRUE ~ 0))
<- glmer(accuracy ~ VariantHeard + PhoneticsClass +
model_acc :PhoneticsClass + SpeakerType +
VariantHeard:MaxClassBinned +
SpeakerType+ (1|Participant) + (1|AudioNumber),
MaxClassBinned data = l2_span_perception,
family = binomial)
summary(model_acc)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula:
accuracy ~ VariantHeard + PhoneticsClass + VariantHeard:PhoneticsClass +
SpeakerType + SpeakerType:MaxClassBinned + MaxClassBinned +
(1 | Participant) + (1 | AudioNumber)
Data: l2_span_perception
AIC BIC logLik deviance df.resid
902.0 962.2 -438.0 876.0 747
Scaled residuals:
Min 1Q Median 3Q Max
-1.4450 -0.7360 -0.4013 0.9083 5.6471
Random effects:
Groups Name Variance Std.Dev.
Participant (Intercept) 0.1562 0.3953
AudioNumber (Intercept) 0.2489 0.4989
Number of obs: 760, groups: Participant, 76; AudioNumber, 10
Fixed effects:
Estimate Std. Error
(Intercept) -0.6566 0.3850
VariantHeardh 0.3195 0.3679
PhoneticsClassYes -0.0715 0.3472
SpeakerTypePuerto Rican 0.1119 0.4169
MaxClassBinnedElementary 0.4950 0.4546
MaxClassBinnedIntermediate High 0.1233 0.3195
MaxClassBinnedIntermediate Low 0.4346 0.3482
VariantHeardh:PhoneticsClassYes 0.5207 0.4349
SpeakerTypePuerto Rican:MaxClassBinnedElementary -1.8700 0.6045
SpeakerTypePuerto Rican:MaxClassBinnedIntermediate High -0.6291 0.3936
SpeakerTypePuerto Rican:MaxClassBinnedIntermediate Low -2.9296 0.5409
z value Pr(>|z|)
(Intercept) -1.705 0.08813 .
VariantHeardh 0.869 0.38510
PhoneticsClassYes -0.206 0.83683
SpeakerTypePuerto Rican 0.268 0.78836
MaxClassBinnedElementary 1.089 0.27623
MaxClassBinnedIntermediate High 0.386 0.69955
MaxClassBinnedIntermediate Low 1.248 0.21201
VariantHeardh:PhoneticsClassYes 1.197 0.23116
SpeakerTypePuerto Rican:MaxClassBinnedElementary -3.093 0.00198 **
SpeakerTypePuerto Rican:MaxClassBinnedIntermediate High -1.598 0.10996
SpeakerTypePuerto Rican:MaxClassBinnedIntermediate Low -5.416 6.09e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) VrntHr PhntCY SpkTPR MxClBE MxCBIH MxCBIL VH:PCY STPR:M
VariantHrdh -0.476
PhntcsClssY -0.207 0.142
SpkrTypPrtR -0.643 -0.010 0.004
MxClssBnndE -0.299 -0.001 0.118 0.221
MxClssBnnIH -0.392 0.000 0.027 0.312 0.332
MxClssBnnIL -0.390 0.000 0.154 0.288 0.331 0.433
VrntHrd:PCY 0.108 -0.218 -0.660 0.002 -0.004 0.000 -0.006
SpkTPR:MCBE 0.191 -0.019 -0.015 -0.277 -0.645 -0.216 -0.201 0.016
SpTPR:MCBIH 0.282 -0.010 -0.002 -0.426 -0.234 -0.711 -0.306 -0.005 0.298
SpTPR:MCBIL 0.220 -0.030 -0.022 -0.310 -0.174 -0.242 -0.546 0.023 0.227
STPR:H
VariantHrdh
PhntcsClssY
SpkrTypPrtR
MxClssBnndE
MxClssBnnIH
MxClssBnnIL
VrntHrd:PCY
SpkTPR:MCBE
SpTPR:MCBIH
SpTPR:MCBIL 0.336
r.squaredGLMM(model_acc)
R2m R2c
theoretical 0.1947105 0.2830057
delta 0.1558159 0.2264736
library(car)
Anova(model_acc)
Chisq | Df | Pr(>Chisq) | |
---|---|---|---|
VariantHeard | 1.3398559 | 1 | 0.2470594 |
PhoneticsClass | 0.6038264 | 1 | 0.4371218 |
SpeakerType | 3.3399850 | 1 | 0.0676152 |
MaxClassBinned | 5.2285785 | 3 | 0.1558045 |
VariantHeard:PhoneticsClass | 1.4336790 | 1 | 0.2311650 |
SpeakerType:MaxClassBinned | 33.5090788 | 3 | 0.0000003 |
effect("SpeakerType:MaxClassBinned", model_acc) %>%
data.frame() %>%
mutate(MaxClassBinned = factor(MaxClassBinned,
levels = c("Elementary",
"Intermediate Low",
"Intermediate High",
"Advanced"))) %>%
ggplot(aes(y = MaxClassBinned,
x = fit,
xmin = lower,
xmax = upper,
color = SpeakerType)) +
geom_errorbar() +
geom_label(aes(label = round(fit, 2))) +
theme_linedraw() +
labs(title = "Probability (fit) of participant being correct") +
scale_color_colorblind() +
facet_wrap(~SpeakerType) +
theme(legend.position = "none") +
scale_y_discrete(limits = rev)