-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmanuscript.aux
103 lines (103 loc) · 22.9 KB
/
manuscript.aux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
\relax
\providecommand\hyper@newdestlabel[2]{}
\providecommand\HyField@AuxAddToFields[1]{}
\providecommand\HyField@AuxAddToCoFields[2]{}
\bibstyle{ajs}
\citation{leisch2018market}
\citation{ggbeeswarm}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces The different ways we might expect a clustering algorithm to partition 2D data with different association structure. If the correlation is high (C), the partitioning happens along the primary direction of the association.}}{2}{figure.1}\protected@file@percent }
\newlabel{kmeans-partition}{{1}{2}{The different ways we might expect a clustering algorithm to partition 2D data with different association structure. If the correlation is high (C), the partitioning happens along the primary direction of the association}{figure.1}{}}
\citation{Asimov1985-xr}
\citation{lee2022}
\citation{cook_manual_1997,laa2023new}
\citation{Laa2020}
\citation{leisch2018market}
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces The typical approach to understanding the data partitions is plotting individual features using histograms, where the group is mapped to colour. Here, the three data sets displayed in Figure 1 are shown in the columns, and the rows correspond to the two variables, V1, V2. Differences between groups can be seen, and interpreted on a single variable level, such as the red group has low values on both V1 and V2 in data set C. However, how the data is divided cannot be seen.}}{3}{figure.2}\protected@file@percent }
\newlabel{kmeans-histogram}{{2}{3}{The typical approach to understanding the data partitions is plotting individual features using histograms, where the group is mapped to colour. Here, the three data sets displayed in Figure 1 are shown in the columns, and the rows correspond to the two variables, V1, V2. Differences between groups can be seen, and interpreted on a single variable level, such as the red group has low values on both V1 and V2 in data set C. However, how the data is divided cannot be seen}{figure.2}{}}
\citation{tourr}
\citation{laa2023new}
\citation{RJ-2023-052}
\citation{ggobi}
\citation{lundh1999introduction}
\citation{schimansky24}
\citation{Hunter:2007}
\citation{tourr}
\citation{reticulate}
\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces The way to examine partitions in high dimensions is to examine linear combinations, for each of the data sets. The three plots show the views of the three data sets (A, B, C) revealing the partitioning. For data sets A and B, the red and orange groups have been faded out in order to focus on the blue and green groups, because there is no single linear combination where partitions between all groups are visible. All four group partitions can be seen for data C, because there is a linear combination where these partitions are all distinctly splitting the groups.}}{4}{figure.3}\protected@file@percent }
\newlabel{kmeans-linear-combinations}{{3}{4}{The way to examine partitions in high dimensions is to examine linear combinations, for each of the data sets. The three plots show the views of the three data sets (A, B, C) revealing the partitioning. For data sets A and B, the red and orange groups have been faded out in order to focus on the blue and green groups, because there is no single linear combination where partitions between all groups are visible. All four group partitions can be seen for data C, because there is a linear combination where these partitions are all distinctly splitting the groups}{figure.3}{}}
\newlabel{interface}{{2}{4}{}{section.2}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Overview of the GUI. A: Sidebar controls; B: Display area; C: Feature selection checkboxes; D: Subset selection with color indicators; E: Frame selection interface; F: Interfaces for adjusting the number of bins of histograms, animating tours and blending out projection axes with a low norm; G: Save and load buttons; H: Interface for starting new tours; I: Metric selection interface.}}{5}{figure.4}\protected@file@percent }
\newlabel{fig:GUI_overview}{{4}{5}{Overview of the GUI. A: Sidebar controls; B: Display area; C: Feature selection checkboxes; D: Subset selection with color indicators; E: Frame selection interface; F: Interfaces for adjusting the number of bins of histograms, animating tours and blending out projection axes with a low norm; G: Save and load buttons; H: Interface for starting new tours; I: Metric selection interface}{figure.4}{}}
\citation{ggobi}
\citation{python}
\citation{R}
\citation{CBCH94}
\citation{lckl2005}
\newlabel{workflow}{{3}{8}{}{section.3}{}}
\citation{keim2010mastering}
\citation{dolnicar2003winter}
\citation{cliff2009formative}
\citation{dolnicar2017peer}
\newlabel{applications}{{4}{10}{}{section.4}{}}
\citation{leisch2018market}
\citation{flexclust}
\@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces Scatterplot of the features alpine skiing and cross-country skiing (A) and three 2D projections from a grand tour on the Austrian Vacation Activities data (B-D). Because the 27 features are binary, you'll see distinct groupings when few features dominate the projection (A-B) but with this many binary features the data mostly looks continuous when the loadings of the other features increase(C-D).}}{11}{figure.5}\protected@file@percent }
\newlabel{fig:winter-gt}{{5}{11}{Scatterplot of the features alpine skiing and cross-country skiing (A) and three 2D projections from a grand tour on the Austrian Vacation Activities data (B-D). Because the 27 features are binary, you'll see distinct groupings when few features dominate the projection (A-B) but with this many binary features the data mostly looks continuous when the loadings of the other features increase(C-D)}{figure.5}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces Traditional overview of clusters. Color represents (A) the intra-cluster fraction, and (B) the intra-feature fraction. From A, we can see that cluster 3 tourists like alpine skiing, going out in the evening and going to discos and bars. They also like relaxing, shopping and sightseeing but these are popular among all tourists. From B, we can see the distribution of activities across clusters, e.g. most tourists who use health facilities are found in cluster 4 while tourists going to a pool or sauna are primarily found in clusters 1 and 5.}}{12}{figure.6}\protected@file@percent }
\newlabel{fig:winteractiv_heatmap}{{6}{12}{Traditional overview of clusters. Color represents (A) the intra-cluster fraction, and (B) the intra-feature fraction. From A, we can see that cluster 3 tourists like alpine skiing, going out in the evening and going to discos and bars. They also like relaxing, shopping and sightseeing but these are popular among all tourists. From B, we can see the distribution of activities across clusters, e.g. most tourists who use health facilities are found in cluster 4 while tourists going to a pool or sauna are primarily found in clusters 1 and 5}{figure.6}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces Comparison of silhouette plots of two $k$-means cluster solutions of the Austrian vacation activities dataset with $k$=6. (A) shows the silhouette plot of the $k$-means solution of the full dataset and (B) the silhouette plot of the $k$-means solution of the dataset after manual feature selection. We can see that the cluster solution with the reduced dataset achieved better silhouette scores and that clusters 1 and 3 contain observations with negative silhouette scores.}}{13}{figure.7}\protected@file@percent }
\newlabel{fig:silhouette_comparison}{{7}{13}{Comparison of silhouette plots of two $k$-means cluster solutions of the Austrian vacation activities dataset with $k$=6. (A) shows the silhouette plot of the $k$-means solution of the full dataset and (B) the silhouette plot of the $k$-means solution of the dataset after manual feature selection. We can see that the cluster solution with the reduced dataset achieved better silhouette scores and that clusters 1 and 3 contain observations with negative silhouette scores}{figure.7}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Display of a projection of the Austrian vacation activities dataset with the six clusters of the $k$-means cluster solution highlighted in different colors. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. The colors indicate which clusters the highlighted observations belong to with cluster 1 being blue (A), cluster 2 being orange (B), cluster 3 being green (C), cluster 4 being red (D), cluster 5 being violet (E) and cluster 6 being brown (F). Some data points appear to be highlighted always, which occurs due to the overlap of many data points in one spot. We can see that the projection separates some clusters well, however also that there is considerable overlap of clusters 4, 5 and 6.}}{14}{figure.8}\protected@file@percent }
\newlabel{fig:winter_activ_cluster_highlights}{{8}{14}{Display of a projection of the Austrian vacation activities dataset with the six clusters of the $k$-means cluster solution highlighted in different colors. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. The colors indicate which clusters the highlighted observations belong to with cluster 1 being blue (A), cluster 2 being orange (B), cluster 3 being green (C), cluster 4 being red (D), cluster 5 being violet (E) and cluster 6 being brown (F). Some data points appear to be highlighted always, which occurs due to the overlap of many data points in one spot. We can see that the projection separates some clusters well, however also that there is considerable overlap of clusters 4, 5 and 6}{figure.8}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {9}{\ignorespaces Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with the linear discriminant analysis projection pursuit index. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with the linear discriminant analysis projection pursuit index. Bottom right: mosaic plot. Tourists in both clusters 1 and 3 didn't participate in skiing a lot, but tourists in cluster 3 were much more interested in going to the pool, spa and health facilities compared to cluster 1.}}{15}{figure.9}\protected@file@percent }
\newlabel{fig:winter_cl7_init}{{9}{15}{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with the linear discriminant analysis projection pursuit index. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with the linear discriminant analysis projection pursuit index. Bottom right: mosaic plot. Tourists in both clusters 1 and 3 didn't participate in skiing a lot, but tourists in cluster 3 were much more interested in going to the pool, spa and health facilities compared to cluster 1}{figure.9}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset, with manually adjusted projections. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with a manually adjusted projection. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with a manually adjusted projection. Bottom right: mosaic plot. Changing the projection axes of ``going to a spa'', ``museums'', ``sightseeing'' and ``using health facilities'' reveals the preferences and overlap of clusters 1 and 3.}}{16}{figure.10}\protected@file@percent }
\newlabel{fig:winter_cl7_pre}{{10}{16}{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset, with manually adjusted projections. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with a manually adjusted projection. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with a manually adjusted projection. Bottom right: mosaic plot. Changing the projection axes of ``going to a spa'', ``museums'', ``sightseeing'' and ``using health facilities'' reveals the preferences and overlap of clusters 1 and 3}{figure.10}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset, after new clusters have been selected manually. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with a manually adjusted projection. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with a manually adjusted projection. Bottom right: mosaic plot. We can see that now almost all museum-goers are in the new manually selected cluster 7 and what their preferences are compared to the other clusters.}}{17}{figure.11}\protected@file@percent }
\newlabel{fig:winter_cl7_post}{{11}{17}{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset, after new clusters have been selected manually. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with a manually adjusted projection. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with a manually adjusted projection. Bottom right: mosaic plot. We can see that now almost all museum-goers are in the new manually selected cluster 7 and what their preferences are compared to the other clusters}{figure.11}{}}
\citation{murtagh2014ward}
\@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Dendrogram of the features of the Australian Vacation Activities dataset using the Ward2 algorithm with the Jaccard index. Features that were clustered together are marked by the colored boxes. We can see, which activities had similar patterns in tourist interest.}}{18}{figure.12}\protected@file@percent }
\newlabel{fig:aus_feature_clustering}{{12}{18}{Dendrogram of the features of the Australian Vacation Activities dataset using the Ward2 algorithm with the Jaccard index. Features that were clustered together are marked by the colored boxes. We can see, which activities had similar patterns in tourist interest}{figure.12}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Australian vacation activities dataset, with the projection axes of the feature ``Friends'' pointing into one direction and all the other ones into another one. Top left: 2D tour. Top right: 1D tour. Bottom left: Mosaic plot. Bottom right: Heatmap with the intra-cluster fraction. We can see how the feature ``Friends'' and the general activity level (represented by the other projection axes pointing in one direction) separate the data.}}{19}{figure.13}\protected@file@percent }
\newlabel{fig:aus_preselection}{{13}{19}{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Australian vacation activities dataset, with the projection axes of the feature ``Friends'' pointing into one direction and all the other ones into another one. Top left: 2D tour. Top right: 1D tour. Bottom left: Mosaic plot. Bottom right: Heatmap with the intra-cluster fraction. We can see how the feature ``Friends'' and the general activity level (represented by the other projection axes pointing in one direction) separate the data}{figure.13}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Interactive tour GUI as seen in \ref {fig:aus_preselection}, but after sub-selection of clusters 7 (pink), 8 (grey) and 9 (gold). Top left: 2D tour. Top right: 1D tour. Bottom left: Mosaic plot. Bottom right: Heatmap with the intra-cluster fraction. We can observe the preferences of three new subsegments, which can be interpreted as very active (pink), moderately active (grey) and inactive (gold) tourists traveling without their friends.}}{20}{figure.14}\protected@file@percent }
\newlabel{fig:aus_selection}{{14}{20}{Interactive tour GUI as seen in \ref {fig:aus_preselection}, but after sub-selection of clusters 7 (pink), 8 (grey) and 9 (gold). Top left: 2D tour. Top right: 1D tour. Bottom left: Mosaic plot. Bottom right: Heatmap with the intra-cluster fraction. We can observe the preferences of three new subsegments, which can be interpreted as very active (pink), moderately active (grey) and inactive (gold) tourists traveling without their friends}{figure.14}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {15}{\ignorespaces Final frame of a guided tour of the clustered risk data with the LDA index. The color of the datapoints indicates which cluster they belong to with cluster 1 being blue, cluster 2 being orange, cluster 3 being green, cluster 4 being red and cluster 5 being violet. Except for Health, all axes point in one direction indicating the projection primarily separates the groups based on their general risk-taking behavior. Most clusters are separated nicely, but clusters 1 and 4 overlap.}}{21}{figure.15}\protected@file@percent }
\newlabel{fig:final_projection_risk}{{15}{21}{Final frame of a guided tour of the clustered risk data with the LDA index. The color of the datapoints indicates which cluster they belong to with cluster 1 being blue, cluster 2 being orange, cluster 3 being green, cluster 4 being red and cluster 5 being violet. Except for Health, all axes point in one direction indicating the projection primarily separates the groups based on their general risk-taking behavior. Most clusters are separated nicely, but clusters 1 and 4 overlap}{figure.15}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {16}{\ignorespaces Interface of the "Guided Tour - LDA - Regroup" function. The option highlighted by the yellow box indicates which element of the dropdown menu to select to spawn the displayed interface. Within the interface, the user can choose how they want the LDA metric to be computed by activating switches. All clusters selected in one column will be considered as a single cluster when computing the LDA metric. This affects only the calculation of the metric and the resulting projection matrices, not the actual selections. If a switch in the "Ignore" column is activated, the data points within that cluster will not be considered for the generation of the guided tour.}}{22}{figure.16}\protected@file@percent }
\newlabel{fig:risk-lda-regroup}{{16}{22}{Interface of the "Guided Tour - LDA - Regroup" function. The option highlighted by the yellow box indicates which element of the dropdown menu to select to spawn the displayed interface. Within the interface, the user can choose how they want the LDA metric to be computed by activating switches. All clusters selected in one column will be considered as a single cluster when computing the LDA metric. This affects only the calculation of the metric and the resulting projection matrices, not the actual selections. If a switch in the "Ignore" column is activated, the data points within that cluster will not be considered for the generation of the guided tour}{figure.16}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {17}{\ignorespaces Final frame of the guided tour generated based on the settings shown in Figure \ref {fig:risk-lda-regroup}. Clusters 1 (blue) and 4 (red) are now much better separated compared to the projection shown in Figure \ref {fig:final_projection_risk}. The most influential features within the data for differentiating between the clusters were career risk and recreational risk, with the former being higher for cluster 1 and the latter for cluster 4.}}{23}{figure.17}\protected@file@percent }
\newlabel{fig:regrouped_risk}{{17}{23}{Final frame of the guided tour generated based on the settings shown in Figure \ref {fig:risk-lda-regroup}. Clusters 1 (blue) and 4 (red) are now much better separated compared to the projection shown in Figure \ref {fig:final_projection_risk}. The most influential features within the data for differentiating between the clusters were career risk and recreational risk, with the former being higher for cluster 1 and the latter for cluster 4}{figure.17}{}}
\newlabel{discussion}{{5}{23}{}{section.5}{}}
\citation{Laa2020}
\citation{shiny}
\newlabel{conclusion}{{6}{24}{}{section.6}{}}
\bibdata{lionfish_references.bib}
\bibcite{Asimov1985-xr}{{1}{1985}{{Asimov}}{{}}}
\bibcite{shiny}{{2}{2024}{{Chang \emph {et~al.}}}{{Chang, Cheng, Allaire, Sievert, Schloerke, Xie, Allen, McPherson, Dipert, and Borges}}}
\bibcite{ggbeeswarm}{{3}{2023}{{Clarke \emph {et~al.}}}{{Clarke, Sherrill-Mix, and Dawson}}}
\bibcite{cliff2009formative}{{4}{2009}{{Cliff}}{{}}}
\newlabel{resources}{{7}{25}{}{section.7}{}}
\newlabel{acknowledgements}{{8}{25}{}{section.8}{}}
\bibcite{cook_manual_1997}{{5}{1997}{{Cook and Buja}}{{}}}
\bibcite{CBCH94}{{6}{1995}{{Cook \emph {et~al.}}}{{Cook, Buja, Cabrera, and Hurley}}}
\bibcite{ggobi}{{7}{2007}{{Cook and Swayne}}{{}}}
\bibcite{dolnicar2017peer}{{8}{2017}{{Dolnicar}}{{}}}
\bibcite{leisch2018market}{{9}{2018}{{Dolnicar \emph {et~al.}}}{{Dolnicar, Grün, and Leisch}}}
\bibcite{dolnicar2003winter}{{10}{2003}{{Dolnicar and Leisch}}{{}}}
\bibcite{RJ-2023-052}{{11}{2023}{{Hart and Wang}}{{}}}
\bibcite{Hunter:2007}{{12}{2007}{{Hunter}}{{}}}
\bibcite{keim2010mastering}{{13}{2010}{{Keim \emph {et~al.}}}{{Keim, Kohlhammer, Ellis, and Mansmann}}}
\bibcite{laa2023new}{{14}{2023}{{Laa \emph {et~al.}}}{{Laa, Aumann, Cook, and Valencia}}}
\bibcite{Laa2020}{{15}{2020}{{Laa \emph {et~al.}}}{{Laa, Cook, and Valencia}}}
\bibcite{lckl2005}{{16}{2005}{{Lee \emph {et~al.}}}{{Lee, Cook, Klinke, and Lumley}}}
\bibcite{lee2022}{{17}{2022}{{Lee \emph {et~al.}}}{{Lee, Cook, da~Silva, Laa, Spyrison, Wang, and Zhang}}}
\bibcite{flexclust}{{18}{2006}{{Leisch}}{{}}}
\bibcite{lundh1999introduction}{{19}{1999}{{Lundh}}{{}}}
\bibcite{murtagh2014ward}{{20}{2014}{{Murtagh and Legendre}}{{}}}
\bibcite{R}{{21}{2024}{{R Core Team}}{{}}}
\bibcite{schimansky24}{{22}{2024}{{Schimansky}}{{}}}
\bibcite{reticulate}{{23}{2024}{{Ushey \emph {et~al.}}}{{Ushey, Allaire, and Tang}}}
\bibcite{python}{{24}{2009}{{Van~Rossum and Drake}}{{}}}
\bibcite{tourr}{{25}{2011}{{Wickham \emph {et~al.}}}{{Wickham, Cook, Hofmann, and Buja}}}
\gdef \@abspage@last{27}