This repository has been archived by the owner on Jan 29, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcluster-analysis.Rmd
329 lines (261 loc) · 11.6 KB
/
cluster-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Cluster Analysis"
author: "Daniel Ebbert"
date: "`r format(Sys.Date(), '%d.%m.%Y')`"
output: html_document
bibliography: Method.bib
---
```{r setup, include = FALSE}
# Set the knitr options
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
fig.path = "fig/")
# Load the required libraries
library(cluster)
library(dplyr)
library(labelled)
library(knitr)
library(tibble)
library(stringr)
library(haven)
library(anchors)
library(ggplot2)
library(scales)
library(reshape2)
library(forcats)
library(factoextra)
library(fpc)
library(conover.test)
library(chisq.posthoc.test)
# Source self written functions for this project as well as of the data submodule
source(paste(getwd(), "/R/00_functions.R", sep = ""))
source(paste(getwd(), "/data/R/functions.R", sep = ""))
# Set seed for reproducability
set.seed(5446)
```
## Load the data
First the original data is loaded, merged and cleaned.
```{r load_data}
# Merge and clean the datasets
source(paste(getwd(), "/R/01_merge-datasets.R", sep = ""))
```
## Average age and gender distribution
The average age in the full data set is: `r round(overall_average_age, digits =2)`
The standard deviation in the full data set is: `r round(overall_sd_age, digits =2)`
The gender distribution in the full data set is as follows:
```{r gender_distribution}
kable(gender_distribution)
```
## Distance matrix
Because the variables that are supposed to be part of the cluster analysis are not all continuous a different distance measure is needed. The used distance measure for this case is Gower distance [@Gower1971]. In order to be able to run a cluster analysis with a different distance measure one has to create a distance matrix using this distance matrix. Therefore, the next step is to create a distance matrix using Gower distance.
```{r gower}
# Create the distance matrix
source(paste(getwd(), "/R/02_gower-distance.R", sep = ""))
# Show a summary of the distance matrix
summary(gower_dist)
```
## Number of clusters
To determine the number of clusters a cluster analysis using PAM has been done with k ranging from 2 till 10. For each k the silhouette coefficient [@Rousseeuw1987] is calculated and the results plotted in the so called silhouette plot.
```{r silhouette-plot}
# Run the multiple cluster analysis for the silhouette plot
source(paste(getwd(), "/R/03_silhouette-plot.R", sep = ""))
kable(sil_width_df)
```
The silhouette coefficient is highest at k = 3 and k = 5. The k = 5 clusters solution is more meaningful. Therefore, 5 clusters are extracted.
```{r set_n_clusters}
# Set the number of clusters
n_clusters <- 5
```
## Cluster analysis
Next the cluster analysis is run, the used method is Partitioning Around Medoids [@Kaufman1990].
```{r pam}
# Run the cluster analysis using PAM
source(paste(getwd(), "/R/04_pam.R", sep = ""))
```
## Aggregate clustering variables
For the following analysis and interpretation the order of the clusters is quite confusing. Therefore the order of the clusters is changed to a meaningful order.
```{r cluster_order}
# Run the cluster analysis using PAM
source(paste(getwd(), "/R/05_change_cluster_order.R", sep = ""))
```
In this step the data that went into the cluster analysis is aggregated by cluster and shown in a table:
```{r results_aggregate}
# Aggregate the data for each cluster
source(paste(getwd(), "/R/06_aggregate.R", sep = ""))
# Show the data for each cluster in a table
kable(cluster_results, row.names = FALSE)
```
## Test for significant differences on each cluster variable
This step tests for a significant difference on each cluster variable. For continious variables the Kruskal-Wallis Rank Sum Test [@Kruskal1952] is used and for categorical variables Pearson's Chi-squared Test for Count Data [@Pearson1900] with Monte Carlo simulation [@Hope1968] is used. The number of repetitions used in the Monte Carlo simulation is 8000 [@Mundform2011].
```{r cluster_variables_significance}
# Test for significant difference on each cluster variable
source(paste(getwd(), "/R/07_tests_for_difference_in_cluster_variables.R", sep = ""))
# Show the test result for each variable in a table
kable(difference_tests_summary_cluster_variables)
```
For the variables where the difference between the clusters is significant a post-hoc test is conducted. For continious variables the post-hoc test used is the Conover-Iman test [@Conover1979] and for the categorical variables the residuals from the Pearson's Chi-squared Test for Count Data with Monte Carlo simulation is used [@Beasley1995]. Based on the residuals a p value is calculated. For this proces a self written R package called chisq.posthoc.test is used. In both cases the p values are adjusted using the Bonferroni method [@Dunn1961].
**Conover-Iman test for: `r cluster_results$Variable[1]`**
```{r difference_test_Q5_1}
# Conover-Iman test to determine which of the clusters differ.
conover.test(cluster_data$Q5_1,
cluster_data$cluster,
method = "bonferroni",
altp = TRUE)
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r cluster_results$Variable[2]`**
```{r difference_test_Q5_2}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(cluster_data$Q5_2, cluster_data$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r cluster_results$Variable[3]`**
```{r difference_test_Q5_4}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(cluster_data$Q5_4, cluster_data$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
**Conover-Iman test for: `r cluster_results$Variable[4]`**
```{r difference_test_Q5_6}
# Conover-Iman test to determine which of the clusters differ.
conover.test(cluster_data$Q5_6,
cluster_data$cluster,
method = "bonferroni",
altp = TRUE)
```
**Conover-Iman test for: `r cluster_results$Variable[5]`**
```{r difference_test_Q6_3}
# Conover-Iman test to determine which of the clusters differ.
conover.test(cluster_data$Q6_3,
cluster_data$cluster,
method = "bonferroni",
altp = TRUE)
```
## Semester distribution
In this step the distribution of the students in the clusters over the semesters in which the data was gathered is shown.
```{r semester_count}
# Count the number of students by semester and by cluster
source(paste(getwd(), "/R/08_semester_count.R", sep = ""))
# Show the results in a table
kable(semester_count, row.names = FALSE)
```
Test if any semester is significantly more present in one of the semesters during which the data was gathered using Pearson's Chi-squared Test for Count Data.
```{r semester_count_sig_distribution}
# Pearson's Chi-squared test for the distribution of the semesters over the clusters
chisq.test(subset_df$Semester, subset_df$cluster)
```
## Aggregate dependent variables
In this step some variables from the original data set are added and aggregated by cluster to aid the interpretation.
```{r dependent}
# Aggregate the additional variables for the interpretation
source(paste(getwd(), "/R/09_filter.R", sep = ""))
# Show the data for each cluster in a table
kable(dependent_df, row.names = FALSE)
```
# Test for significant difference in dependent variables
This step tests for a significant difference on each dependent variable. For continious variables the Kruskal-Wallis Rank Sum Test is used and for categorical variables Pearson's Chi-squared Test for Count Data with Monte Carlo simulation is used.
```{r dependent_variables_significance}
# Test for significant difference on each dependent variable
source(paste(getwd(), "/R/10_tests_for_difference_in_dependent_variables.R", sep = ""))
# Show the test result for each variable in a table
kable(difference_tests_summary_dependent_variables)
```
For the variables where the difference between the clusters is significant a post-hoc test is conducted. For continious variables the post-hoc test used is the Conover-Iman test and for the categorical variables the residuals from the Pearson's Chi-squared Test for Count Data with Monte Carlo simulation is used. Based on the residuals a p value is calculated. For this proces a self written R package called chisq.posthoc.test is used. In both cases the p values are adjusted using the Bonferroni method.
**Conover-Iman test for: `r dependent_df$Variable[2]`**
```{r difference_test_Q2_2}
# Conover-Iman test to determine which of the clusters differ.
conover.test(
dependent_df_raw$Q2_2,
dependent_df_raw$cluster,
method = "bonferroni",
altp = TRUE
)
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r dependent_df$Variable[3]`**
```{r difference_test_Q3_1}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(dependent_df_raw$Q3_1, dependent_df_raw$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
**Conover-Iman test for: `r dependent_df$Variable[5]`**
```{r difference_test_Q8_1}
# Conover-Iman test to determine which of the clusters differ.
conover.test(
as.numeric(dependent_df_raw$Q8_1),
dependent_df_raw$cluster,
method = "bonferroni",
altp = TRUE
)
```
**Compare the answers about the learning process by cluster**
```{r lernprozess}
# Get the answers about the learning proces per cluster
source(paste(getwd(), "/R/11_proces.R", sep = ""))
# Show the data about the learning process for each cluster in a table
kable(learn_df, row.names = FALSE)
```
**Test significance for learning proces variables**
```{r learning_variables_significance}
# Test for significant difference on each dependent variable
source(paste(getwd(), "/R/12_tests_for_difference_in_learning_variables.R", sep = ""))
# Show the test result for each variable in a table
kable(difference_tests_summary_learning_variables)
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r learn_df$Variable[1]`**
```{r difference_test_Q6_1_1}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(learn_df_raw$Q6_1_1, learn_df_raw$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r learn_df$Variable[2]`**
```{r difference_test_Q6_1_2}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(learn_df_raw$Q6_1_2, learn_df_raw$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r learn_df$Variable[3]`**
```{r difference_test_Q6_1_3}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(learn_df_raw$Q6_1_3, learn_df_raw$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
**Pearson's Chi-squared Test for Count Data with Monte Carlo simulation residuals analysis for: `r learn_df$Variable[5]`**
```{r difference_test_Q6_1_5}
# Run a posthoc test based on the residuals to determine which clusters differ
kable(chisq.posthoc.test(
table(learn_df_raw$Q6_1_5, learn_df_raw$cluster),
method = "bonferroni",
simulate.p.value = TRUE,
B = 8000
))
```
# Session Info
Show the session info to state which packages have been used for the analysis.
```{r session}
# Show the session info
sessionInfo()
```
# References