data-explorations.Rmd

---
title: "Spanish CDI explorations"
author: "Paulina & Mike"
date: "2022-10-13"
output: html_document
---

# Intro

Goal of this project is to use the five Spanish CDI datasets in Wordbank to try and investigate dialect variation in Spanish language acquisition. 

Here are some potential questions:

* Do the sumscores across the intersecting items look the same at the population level (correcting for demographics)?
* Do individual common items have similar psychometric / developmental properties?
* What are the properties of items with the same unilemma but different item definitions?
* What are the properties of items that are not shared across dialects? (Could also compare Spanish (Mexican) between monolingual and bilingual populations)


```{r setup}
library(tidyverse)
library(wordbankr)
library(arm)
```

# Data loading

Start with summary scores. 

```{r}
eu_ws <- get_administration_data(language = "Spanish (European)", 
                                 form = "WS", 
                                 include_demographic_info = TRUE)
mx_ws <- get_administration_data(language = "Spanish (Mexican)", 
                                 form = "WS", 
                                 include_demographic_info = TRUE)
pr_ws <- get_administration_data(language = "Spanish (Peruvian)", 
                                 form = "WS", 
                                 include_demographic_info = TRUE)
ar_ws <- get_administration_data(language = "Spanish (Argentinian)", 
                                 form = "WS", 
                                 include_demographic_info = TRUE)

sp_ws <- bind_rows(eu_ws, 
                   mx_ws, 
                   pr_ws, 
                   ar_ws)
```

Make a plot!

```{r}
ggplot(sp_ws, aes(x = age, y = production)) + 
  geom_jitter(width = .2, alpha = .2) + 
  geom_smooth() + 
  facet_wrap(~language)
```

# Comparison on the intersection of items

QUESTION: Do the sumscores across the intersecting items look the same at the population level (correcting for demographics)?

```{r}
langs <- c("Spanish (European)", "Spanish (Mexican)", "Spanish (Peruvian)", "Spanish (Argentinian)")

d_ws <- map_df(langs, function(x) get_instrument_data(language = x, 
                                                      form = "WS", 
                                                      administration_info = TRUE, 
                                                      item_info = TRUE))
```

Find the overlapping unilemmas.

For now, pull those unilemmas that are: 
1) in all languages, 
2) only once in each language. 

```{r}
items <- map_df(langs, function(x) get_item_data(language = x, form = "WS"))

intersection <- items |>
  group_by(uni_lemma) |>
  summarise(n_langs = length(unique(language)), 
            n = n()) |>
  filter(n_langs == 4, n == 4) |>
  pull(uni_lemma)
# 224 common items
``` 

Filter data and replot. 

```{r}
ms_ws <- d_ws |>
  filter(uni_lemma %in% intersection) |>
  group_by(child_id, language, age) |>
  summarise(produces = sum(produces))

ggplot(ms_ws, aes(x = age, y = produces)) + 
  geom_jitter(width = .2, alpha = .2) + 
  geom_smooth() + 
  facet_wrap(~language)
```


# Comparison on the intersection of items
Controlling for demographics

```{r}
summary(sp_ws)
summary(ar_ws)
summary(pr_ws)
summary(eu_ws)
summary(mx_ws)
# Hoff and Marchman has no data for production 
# 536 missing data points

# Identify the rows with missing data
which(is.na(mx_ws$production))
mx_ws <- mx_ws[-which(is.na(mx_ws$production)),]

# new binding
sp_ws <- bind_rows(eu_ws, 
                   mx_ws, 
                   pr_ws, 
                   ar_ws)
# Identify rows without caregiver_education (19X missing data points)
sp_ws <- sp_ws[-which(is.na(sp_ws$caregiver_education)),]

# with the new data controlic for demographics
sp_ws_democontrol <- lm(sp_ws$production~
                          sp_ws$caregiver_education+
                          sp_ws$sex)

summary(sp_ws_democontrol)
sp_ws$production_democontrol <- residuals(sp_ws_democontrol)

# controlar con la edad porque no son lineales, probar ese modelo
# Hacer un modelo que incluya tenga la edad dentro

#Plot the new controlled dataset
ggplot(sp_ws, aes(x = age, y = production_democontrol)) + 
  geom_jitter(width = .2, alpha = .2) + 
  geom_smooth() + 
  facet_wrap(~language)

#If we consider caregiver_education for controlling, then we won't have data for Peruvian Spanish
#Second option is to not taking into a consideration caregiver_education

### New controlling version without caregiver_education
# add the Peruvian data points again
# new binding
sp_ws <- bind_rows(eu_ws, 
                   mx_ws, 
                   pr_ws, 
                   ar_ws)
# with the new data controlic for demographics
sp_ws_democontrol2 <- lm(sp_ws$production~
                          sp_ws$sex)

summary(sp_ws_democontrol2)
sp_ws$production_democontrol2 <- residuals(sp_ws_democontrol2)

#Plot the new controlled dataset
ggplot(sp_ws, aes(x = age, y = production_democontrol2)) + 
  geom_jitter(width = .2, alpha = .2) + 
  geom_smooth() + 
  facet_wrap(~language)

```

* Do the sumscores across the intersecting items look the same at the population level (correcting for demographics)?
No hay diferencias entre la data sin control versus la data controlado por sexo (unica variable posible de controlar). Esto de una manera visual.


# Comparison of developmental properties 

QUESTION: Do individual common items have similar psychometric / developmental properties?

```{r}
source("scripts/fit_models.R")

wb_data <- d_ws |> 
  filter(uni_lemma %in% intersection) |>
  group_by(uni_lemma, language, age) |> 
  summarise(total = n(),
            num_true = sum(produces, na.rm = TRUE))

aoas <- fit_aoas(wb_data)
```

```{r}
ggplot(data = aoas) +
  geom_histogram(aes(x = aoa))+ 
  facet_wrap(~language)
```

De acuerdo al modelo predictivo el desarrollo de unilemas para cada una de las poblaciones estudiadas (dialectos del espanhol) es similar (controlado por edad; independiente de la edad).

Calculate correlations between languages

```{r}
cor_data <- aoas |> 
  pivot_wider(id_cols = uni_lemma, 
              names_from = language,
              values_from = aoa) |> 
  ungroup()

correl <- cor(cor_data |> dplyr::select(-"uni_lemma"), use = "complete.obs")
```

Plot correlogram
```{r}
library(corrgram)
corrgram(correl, type = "cor", panel = panel.cor)
```

El mas parecido es el espanhol argentino con el peruano y los menos parecido es el espanhol mexicano con el peruano.

#  What are the properties of items with the same unilemma but different item definitions?