rChapter5-3.Rmd

---
title: "Cross-tabulation of clusters"
description: |
  Chapter 5.3	Cross-tabulation of groups from different dissimilarity matrices
output: distill::distill_article
---

```{r setup, include=FALSE}

# Load required packages
library(here)
source(here("source", "load_libraries.R"))

# Output options
knitr::opts_chunk$set(eval=TRUE, echo=TRUE)
options("kableExtra.html.bsTable" = T)

# load data for Chapter 5
load(here("data", "5-0_ChapterSetup.RData"))

```


```{r, xaringanExtra-clipboard, echo=FALSE}
htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
    success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
    error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
  ),
  rmarkdown::html_dependency_font_awesome()
)
```

<details><summary>**Click here to get instructions...**</summary>

- Please download and unzip the replication files for Chapter 5
([`r fontawesome::fa("far fa-file-zipper")` Chapter05.zip](source/Chapter05.zip)). 
- Read `readme.html` and run `5-0_ChapterSetup.R`. This will create `5-0_ChapterSetup.RData` in the sub folder `data/R`. This file contains the data required to produce the plots shown below. 
- You also have to add the function `legend_large_box` to your environment in order to render the tweaked version of the legend described below. You find this file in the `source` folder of the unzipped Chapter 5 archive.
- We also recommend to load the libraries listed in Chapter 5's `LoadInstallPackages.R`

```{r, eval=FALSE}
# assuming you are working within .Rproj environment
library(here)

# install (if necessary) and load other required packages
source(here("source", "load_libraries.R"))

# load environment generated in "5-0_ChapterSetup.R"
load(here("data", "R", "5-0_ChapterSetup.RData"))

```
</details>

\

In chapter 5.3, we introduce one of the options to account for the parallel unfolding of temporal processes: the cross-tabulation of cluster solutions extracted separately from two (or more) pools of sequences representing the trajectories in different domains. We are now using the `data.frame` `multidim`, which contains both family formation and labour market sequences. The data come from a sub-sample of the German Family Panel - pairfam. For further information on the study and on how to access the full scientific use file see [here](https://www.pairfam.de/en/){target="_blank"}.

## Preparatory work for family formation trajectories

First, we run a Ward cluster analysis based on the dissimilarity matrix `mc.fam.year.om`:

```{r, eval=TRUE, echo=TRUE}
fam.ward<-hclust(as.dist(mc.fam.year.om), 
                       method="ward.D", 
                       members=multidim$weight40)
```

... to be used as initialization of the PAM clustering

```{r, eval=TRUE, echo=TRUE}
fam.pam <- wcKMedRange(mc.fam.year.om, 
                            weights = multidim$weight40, 
                            kvals = 2:10,
                            initialclust = fam.ward)
```

We now extract 5 clusters...

```{r, eval=TRUE, echo=TRUE}
fam.pam.5cl <- fam.pam$clustering$cluster5
```

...attach the cluster info to the main `data.frame` `multidim`...

```{r, eval=TRUE, echo=TRUE}
multidim$fam.pam.5cl<-fam.pam.5cl
```

... and re-label clusters from 1 to 5 instead of medoid identifiers...

```{r, eval=TRUE, echo=TRUE}
fam.pam.5cl.factor <- factor(fam.pam.5cl, 
                             levels = c(16, 460, 479, 892, 898), 
                             c("1", "2", "3", "4", "5"))
```

...to finally attach the factor info to the main `data.frame` `multidim`:

```{r, eval=TRUE, echo=TRUE}
multidim$fam.pam.5cl.factor<-fam.pam.5cl.factor
```

## Preparatory work for labor market trajectories

First, we run a Ward cluster analysis based on the dissimilarity matrix `mc.act.year.om`:

```{r, eval=TRUE, echo=TRUE}
act.ward<-hclust(as.dist(mc.act.year.om), 
                 method="ward.D", 
                 members=multidim$weight40)
```

... to be used as initialization of the PAM clustering

```{r, eval=TRUE, echo=TRUE}
act.pam <- wcKMedRange(mc.act.year.om, 
                       weights = multidim$weight40, 
                       kvals = 2:10,
                       initialclust = act.ward)
```

We now extract 5 clusters...

```{r, eval=TRUE, echo=TRUE}
act.pam.5cl <- act.pam$clustering$cluster5
```

...attach the cluster info to the main `data.frame` `multidim`...

```{r, eval=TRUE, echo=TRUE}
multidim$act.pam.5cl<-act.pam.5cl
```

... and re-label clusters from 1 to 5 instead of medoid identifiers...

```{r, eval=TRUE, echo=TRUE}
act.pam.5cl.factor <- factor(act.pam.5cl, 
                             levels = c(6, 25, 78, 539, 709), 
                             c("1", "2", "3", "4", "5"))
```

...to finally attach the factor info to the main `data.frame` `multidim`

```{r, eval=TRUE, echo=TRUE}
multidim$act.pam.5cl.factor<-act.pam.5cl.factor
```

## Cross-tabulation for a 5-cluster solution on both channels 

Tabulate the two vectors and store the results in an object that we name `crosstab`...

```{r, eval=TRUE, echo=TRUE}
crosstab<-table(multidim$act.pam.5cl.factor, multidim$fam.pam.5cl.factor)
```

...to print it at our convenience:

```{r, eval=TRUE, echo=TRUE}
crosstab
```