forked from mcfrank/spanish-cdi
-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-explorations.Rmd
225 lines (167 loc) · 6.47 KB
/
data-explorations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "Spanish CDI explorations"
author: "Paulina & Mike"
date: "2022-10-13"
output: html_document
---
# Intro
Goal of this project is to use the five Spanish CDI datasets in Wordbank to try and investigate dialect variation in Spanish language acquisition.
Here are some potential questions:
* Do the sumscores across the intersecting items look the same at the population level (correcting for demographics)?
* Do individual common items have similar psychometric / developmental properties?
* What are the properties of items with the same unilemma but different item definitions?
* What are the properties of items that are not shared across dialects? (Could also compare Spanish (Mexican) between monolingual and bilingual populations)
```{r setup}
library(tidyverse)
library(wordbankr)
library(arm)
```
# Data loading
Start with summary scores.
```{r}
eu_ws <- get_administration_data(language = "Spanish (European)",
form = "WS",
include_demographic_info = TRUE)
mx_ws <- get_administration_data(language = "Spanish (Mexican)",
form = "WS",
include_demographic_info = TRUE)
pr_ws <- get_administration_data(language = "Spanish (Peruvian)",
form = "WS",
include_demographic_info = TRUE)
ar_ws <- get_administration_data(language = "Spanish (Argentinian)",
form = "WS",
include_demographic_info = TRUE)
sp_ws <- bind_rows(eu_ws,
mx_ws,
pr_ws,
ar_ws)
```
Make a plot!
```{r}
ggplot(sp_ws, aes(x = age, y = production)) +
geom_jitter(width = .2, alpha = .2) +
geom_smooth() +
facet_wrap(~language)
```
# Comparison on the intersection of items
QUESTION: Do the sumscores across the intersecting items look the same at the population level (correcting for demographics)?
```{r}
langs <- c("Spanish (European)", "Spanish (Mexican)", "Spanish (Peruvian)", "Spanish (Argentinian)")
d_ws <- map_df(langs, function(x) get_instrument_data(language = x,
form = "WS",
administration_info = TRUE,
item_info = TRUE))
```
Find the overlapping unilemmas.
For now, pull those unilemmas that are:
1) in all languages,
2) only once in each language.
```{r}
items <- map_df(langs, function(x) get_item_data(language = x, form = "WS"))
intersection <- items |>
group_by(uni_lemma) |>
summarise(n_langs = length(unique(language)),
n = n()) |>
filter(n_langs == 4, n == 4) |>
pull(uni_lemma)
# 224 common items
```
Filter data and replot.
```{r}
ms_ws <- d_ws |>
filter(uni_lemma %in% intersection) |>
group_by(child_id, language, age) |>
summarise(produces = sum(produces))
ggplot(ms_ws, aes(x = age, y = produces)) +
geom_jitter(width = .2, alpha = .2) +
geom_smooth() +
facet_wrap(~language)
```
# Comparison on the intersection of items
Controlling for demographics
```{r}
summary(sp_ws)
summary(ar_ws)
summary(pr_ws)
summary(eu_ws)
summary(mx_ws)
# Hoff and Marchman has no data for production
# 536 missing data points
# Identify the rows with missing data
which(is.na(mx_ws$production))
mx_ws <- mx_ws[-which(is.na(mx_ws$production)),]
# new binding
sp_ws <- bind_rows(eu_ws,
mx_ws,
pr_ws,
ar_ws)
# Identify rows without caregiver_education (19X missing data points)
sp_ws <- sp_ws[-which(is.na(sp_ws$caregiver_education)),]
# with the new data controlic for demographics
sp_ws_democontrol <- lm(sp_ws$production~
sp_ws$caregiver_education+
sp_ws$sex)
summary(sp_ws_democontrol)
sp_ws$production_democontrol <- residuals(sp_ws_democontrol)
# controlar con la edad porque no son lineales, probar ese modelo
# Hacer un modelo que incluya tenga la edad dentro
#Plot the new controlled dataset
ggplot(sp_ws, aes(x = age, y = production_democontrol)) +
geom_jitter(width = .2, alpha = .2) +
geom_smooth() +
facet_wrap(~language)
#If we consider caregiver_education for controlling, then we won't have data for Peruvian Spanish
#Second option is to not taking into a consideration caregiver_education
### New controlling version without caregiver_education
# add the Peruvian data points again
# new binding
sp_ws <- bind_rows(eu_ws,
mx_ws,
pr_ws,
ar_ws)
# with the new data controlic for demographics
sp_ws_democontrol2 <- lm(sp_ws$production~
sp_ws$sex)
summary(sp_ws_democontrol2)
sp_ws$production_democontrol2 <- residuals(sp_ws_democontrol2)
#Plot the new controlled dataset
ggplot(sp_ws, aes(x = age, y = production_democontrol2)) +
geom_jitter(width = .2, alpha = .2) +
geom_smooth() +
facet_wrap(~language)
```
* Do the sumscores across the intersecting items look the same at the population level (correcting for demographics)?
No hay diferencias entre la data sin control versus la data controlado por sexo (unica variable posible de controlar). Esto de una manera visual.
# Comparison of developmental properties
QUESTION: Do individual common items have similar psychometric / developmental properties?
```{r}
source("scripts/fit_models.R")
wb_data <- d_ws |>
filter(uni_lemma %in% intersection) |>
group_by(uni_lemma, language, age) |>
summarise(total = n(),
num_true = sum(produces, na.rm = TRUE))
aoas <- fit_aoas(wb_data)
```
```{r}
ggplot(data = aoas) +
geom_histogram(aes(x = aoa))+
facet_wrap(~language)
```
De acuerdo al modelo predictivo el desarrollo de unilemas para cada una de las poblaciones estudiadas (dialectos del espanhol) es similar (controlado por edad; independiente de la edad).
Calculate correlations between languages
```{r}
cor_data <- aoas |>
pivot_wider(id_cols = uni_lemma,
names_from = language,
values_from = aoa) |>
ungroup()
correl <- cor(cor_data |> dplyr::select(-"uni_lemma"), use = "complete.obs")
```
Plot correlogram
```{r}
library(corrgram)
corrgram(correl, type = "cor", panel = panel.cor)
```
El mas parecido es el espanhol argentino con el peruano y los menos parecido es el espanhol mexicano con el peruano.
# What are the properties of items with the same unilemma but different item definitions?