-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmanuscript.tex
466 lines (300 loc) · 75 KB
/
manuscript.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
\documentclass[article]{ajs}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{etoolbox}
\usepackage{amsmath}
\usepackage{graphicx}
\graphicspath{{images/}}
%% almost as usual
\author{Matthias Medl\,\orcidlink{0000-0002-3354-4579}\\ Institute of Statistics \\ BOKU University \\ Vienna \And
Dianne Cook\,\orcidlink{0000-0002-3813-7155}\\ Econometrics and Business Statistics \\ Monash University \\ Melbourne \And
Ursula Laa\,\orcidlink{0000-0002-0249-6439}\\ Institute of Statistics \\ BOKU University \\ Vienna}
%% \AND can be used instead of \And and starts a new line
\title{Demonstrating the Capabilities of the Lionfish Software for Interactive Visualization of Market Segmentation Partitions}
%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Matthias Medl, Dianne Cook, Ursula Laa} %% comma-separated
\Plaintitle{Demonstrating the Capabilities of the Lionfish Software for Interactive Visualization of Market Segmentation Partitions} %% without formatting
\Shorttitle{Exploring Market Segmentation using Lionfish} %% a short title (if necessary)
%% an abstract and keywords
\Abstract{
%Market segmentation presents a challenge in data science, particularly when dealing with high-dimensional data where regions of interest often overlap. In such cases, traditional feature selection and clustering algorithms can fall short in delivering both optimal and actionable solutions. The complexity of customer behavior, characterized by multiple and often overlapping traits, further complicates the identification of distinct market segments that can guide precise and targeted marketing strategies. Addressing these challenges requires a more nuanced and interactive approach to data exploration and analysis.
Market segmentation partitions multivariate data using some clustering algorithm, resulting in some number of homogeneous groups of consumers for marketing purposes. Often this type of data has no clear cluster structure, that is, no separations or gaps between groups of points, which is why this is considered partitioning rather than clustering. Understanding the differences between groups is typically done by examining single variables. However, this can be confusing as multiple groups might share similar features on individual variables. The market segmentation partitions that define the groups are actually different linear constraints on the variables. To understand what is special about the characteristics of a group of customers we should examine linear combinations of variables. Differences between groups are also discerned by different linear combinations. The \texttt{lionfish} \proglang{R} package facilitates interactive and dynamic exploration of the market segmentations. It also allows users to refine groups interactively, focusing on aspects of the data most relevant to their specific analytical objectives. The package integrates tour algorithms through the \texttt{tourr} package, enabling multivariate visualization of high-dimensional data, with \proglang{Python}-powered interactivity, allowing manual adjustments and redefinition of boundaries based on visual feedback. In this article, we will (i) describe the \texttt{lionfish} software and (ii) demonstrate its utility through three example analyses in the domain of market segmentation. The flexible, user-driven approach provided by \texttt{lionfish} offers deeper insights into complex market behaviors, enabling more effective segmentation and enhancing strategic decision-making.
%In this study, we analyzed two datasets derived from surveys conducted in Austria and Australia, where tourists reported on their vacation activities. These datasets serve as examples of the challenges posed by market segmentation. To address these, we developed and employed the \texttt{lionfish} \proglang{R} package, which facilitates interactive and dynamic exploration of clustering solutions. Unlike static approaches, \texttt{lionfish} allows users to refine clusters iteratively, focusing on aspects of the data most relevant to their specific analytical objectives. The package integrates tour algorithms through the \texttt{tourr} package, enabling advanced visualization of high-dimensional data, while \proglang{Python}-powered interactivity allows for manual adjustments and redefinition of cluster boundaries based on visual feedback.
%By bridging the gap between automated algorithms and human intuition, \texttt{lionfish} enables analysts to surpass the limitations of traditional clustering methods. It offers a flexible, user-driven approach that reveals deeper insights into complex market behaviors, facilitating more effective segmentation and ultimately improving strategic decision-making.
}
\Keywords{interactive graphics, tourr, exploratory data analysis, \proglang{R}, \proglang{Python}, clustering, market segmentation}
\Plainkeywords{interactive graphics, tourr, exploratory data analysis, R, python} %% without formatting
%% at least one keyword must be supplied
%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}
%% \setcounter{page}{1}
\Pages{1--xx}
%% The address of (at least) one author should be given
%% in the following format:
\Address{
Matthias Medl\\
Institute of Statistics\\
BOKU University Vienna\\
E-mail: \email{matthias.medl@boku.ac.at}\\
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/512/507-7103
%% Fax: +43/512/507-2851
%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}
%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\section{Introduction}
Clustering algorithms are often used to make large and complex datasets more digestible. For market segmentation, clustering algorithms are used to partition observations into a small number of groups, by incorporating associations between the variables. Market segmentation supports targeted approaches to different groups of customers based on common traits. Clustering provides a data-driven solution to partitioning customer data into marketing segments. \cite{leisch2018market} provides an extensive overview of using clustering algorithms for market segmentation purposes.
A difference between clustering analysis and partitioning is typically the nature of the data. With cluster analysis, we usually envision data that contains separated clusters, and a successful clustering result is one that divides the data based on these gaps. With partitioning, the role of the clustering algorithm is to propose smaller homogeneous subsets, whether there are gaps present or not, with potential for different marketing approaches as done with market segmentation. The shape of the data will affect how it is partitioned. Figure \ref{kmeans-partition} illustrates the different ways that clustering might partition a 2D data set into four groups, when the correlation between the two features varies. When the correlation is high (Figure \ref{kmeans-partition}C), the partitioning will be along the combination of features that produces the highest variance. With lower correlation (Figure \ref{kmeans-partition}B), it will segment the bottom and top, and divide the middle into two parts in the opposite direction. When there is no association, the partitioning is radial like a windmill (Figure \ref{kmeans-partition}A).
\begin{figure*}[ht]
\centerline{\includegraphics[width=1\textwidth]{images/intro1.pdf}}
\caption{The different ways we might expect a clustering algorithm to partition 2D data with different association structure. If the correlation is high (C), the partitioning happens along the primary direction of the association.}
\label{kmeans-partition}
\end{figure*}
From the plots of the full 2D data, in each data set, we can see that the data is divided into four parts. Typically though, the approach is to plot the group on a single variable, as done in Figure \ref{kmeans-histogram}. The histograms of the two features, V1, V2, show some relationships between the four groups. For C the red cluster has low values on both variables and the green cluster has high values for both. Looking more closely, the orange cluster has moderately low on both and the blue cluster has moderately high on both. We could infer from this that the partitioning is being done along an equal combination of V1 and V2, but {\em we cannot see the partition}. %The distinct borders between the partitions can only be seen from the scatterplots of both features. With data sets A and B, the logic for how the data is partitioned is harder still.
When there are more than two features, histograms of the individual features are still commonly used to display the partitioning results. This means that the analyst likely cannot understand how the partitioning divides the data. All that they can observe is roughly how the individual features relate to the partition, which is useful but inadequate. To understand how the partitions are formed in the data we need to view linear combinations of variables, and often, focus on a subset of groups. Figure \ref{kmeans-linear-combinations} illustrates this approach, for the 2D, where it's not necessary, but will be useful to obtain a mental model of how this operates in higher dimensions. The partitions are cuts of the data, and so best viewed as jittered dotplots (using the ggbeeswarm package \citep{ggbeeswarm}). For data A we plot V1 where the partitioning of the green and blue groups happens. Note that the orange and red partitions are faded in the plot, because there is no single linear view where all partitions can be viewed, so one needs to subset to focus on two groups only. Compare this with the full data plot in Figure \ref{kmeans-partition}. The partition between the red and orange groups could be seen if we plotted V2, and subsetted the data to these two groups. Similarly, the partitions between other pairs would be examined. For data B, the linear combination that is roughly a contrast of V1 and V2 shows the distinction between the blue and green partitions, and because there is no single linear combination revealing all partitions we would choose a different linear combination to examine the partitions between other pairs of groups. In data C all four partitions can be seen when a linear combination constructed from a roughly equal combination of V1 and V2 is used. Here this is actually computed by using the first principal component.
\begin{figure*}[!t]
\centerline{\includegraphics[width=1\textwidth]{images/intro2.pdf}}
\caption{The typical approach to understanding the data partitions is plotting individual features using histograms, where the group is mapped to colour. Here, the three data sets displayed in Figure 1 are shown in the columns, and the rows correspond to the two variables, V1, V2. Differences between groups can be seen, and interpreted on a single variable level, such as the red group has low values on both V1 and V2 in data set C. However, how the data is divided cannot be seen.}
\label{kmeans-histogram}
\end{figure*}
\begin{figure*}[!t]
\centerline{\includegraphics[width=1\textwidth]{images/intro3.pdf}}
\caption{The way to examine partitions in high dimensions is to examine linear combinations, for each of the data sets. The three plots show the views of the three data sets (A, B, C) revealing the partitioning. For data sets A and B, the red and orange groups have been faded out in order to focus on the blue and green groups, because there is no single linear combination where partitions between all groups are visible. All four group partitions can be seen for data C, because there is a linear combination where these partitions are all distinctly splitting the groups.}
\label{kmeans-linear-combinations}
\end{figure*}
With data having more than two variables, tour methods to view high dimensions can be used to do this visualization. A tour (\cite{Asimov1985-xr}, and see \cite{lee2022}) is used to show scatterplots of linear combinations of features and thus provide views like that in Figures \ref{kmeans-partition} and \ref{kmeans-linear-combinations} where distinct differences between partitions can be observed. High dimensions are still tricky, and a combination of animations of the linear combinations, and interactive control \citep{cook_manual_1997, laa2023new} over the combinations is important. A scatterplot of a combination of features can be considered to be a projection of the data, and thus like a shadow of a 3D object, some aspects of the data (object) can be obscured. Using slices of the projected data \citep{Laa2020} can be a useful addition to projections. This paper illustrates how to do this to better understand partitioning results for multivariate data.
%\begin{itemize} \itemsep 0in
%\item Background to clustering analysis
%\item Visualising clusters, including tours
%\item Difference between clustering and partitioning
%\item How do you know whether the results are useful
%\end{itemize}
This paper is organised as follows. Section \ref{interface} describes the software and it's attributes. The user workflow is explained in Section \ref{workflow}. The market segmentation case studies using various tourism datasets provided in \cite{leisch2018market} can be found in Section \ref{applications}. Sections \ref{discussion} and \ref{conclusion} discuss the limitations and potential future developments of lionfish.
\section{Interactive interface with lionfish}~\label{interface}
One of the aims of this work is to build an interface that allows for interactive exploration of clustered multivariate data. In some combination of the features, one should be able to see the separations between the clusters as illustrated in the 2D example in Figure \ref{kmeans-partition}.
While tour animations are best obtained within \proglang{R} using the \texttt{tourr} package~\citep{tourr}, it does not enable the interactivity required for example for a manual tour~\citep{laa2023new}. Interactive graphics are available when using \proglang{Javascript}, as implemented in the \texttt{detourr} package \citep{RJ-2023-052}. This allows to replay a recorded tour path with interactive graphics, and can also be linked with additional displays, but lacks capabilities for manual tours. The objectives for the interface are to enable:
\begin{enumerate}
\item Visualizing the partitions between clusters in combinations of features using tour methods, with both a grand tour and manual tour.
\item Interactively selecting datapoints (brushing) to refine the cluster solution, like done with spin-and-brush tools~\citep{ggobi}.
\item Linking multiple displays to focus on particular clusters, and simplify the problem to better understand the clustering solution.
\item Update displays based on user selections such as feature selection or cluster selection, or re-scaling.
\end{enumerate}
To integrate user interactions with the capabilities of the \texttt{tourr} package an active communication with the interface is required. For example, we may wish to explore the local neighbourhood of a projection selected by the user with a local tour animation provided by \texttt{tourr}, or we may want to optimize a guided tour path using groups identified via brushing.
Our solution is using \proglang{Python} for high-performance interactivity, through the packages \texttt{TKinter} \citep{lundh1999introduction}, \texttt{CustomTKinter} \citep{schimansky24}, and \texttt{matplotlib} \citep{Hunter:2007}, with integration to the \texttt{tourr} package \citep{tourr} via \texttt{reticulate} \citep{reticulate}, a framework that facilitates seamless interoperability between \proglang{Python} and \proglang{R}.
%At the core of the \texttt{lionfish} package is a graphical user interface (GUI) designed to display multiple linked visualizations concurrently. Users can interact with one of the plots, for example, by subselecting specific data points, and the other linked plots will dynamically update to reflect these modifications. This interconnected functionality allows for efficient and immediate analysis of high-dimensional datasets without necessitating extensive coding, thereby streamlining the exploratory data analysis process.
The interface was implemented in the \proglang{R} package \texttt{lionfish} and offers a variety of linked interactive plot types, providing users with the flexibility to visualize their data from multiple perspectives. The ability to navigate through various projections of the displayed tours directly within a graphical user interface (GUI) enables users to explore different aspects of the dataset. Furthermore, users can initiate new tours directly from the interface. The GUI also supports interactive feature selection, allowing users to specify which subset of features should be visualized in the plots. Once users have identified interesting views or settings, \texttt{lionfish} allows them to save the displayed projections, subsets, and plots. This functionality ensures that analysis states can be preserved for further examination or reporting, making the package particularly useful for iterative analysis where findings may need to be revisited or shared with collaborators.
With its high level of interactivity, performance, and ease of use, \texttt{lionfish} streamlines the exploration of complex datasets, offering a powerful tool for researchers working with high-dimensional data.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{GUI_overview.pdf}
\caption{Overview of the GUI. A: Sidebar controls; B: Display area; C: Feature selection checkboxes; D: Subset selection with color indicators; E: Frame selection interface; F: Interfaces for adjusting the number of bins of histograms, animating tours and blending out projection axes with a low norm; G: Save and load buttons; H: Interface for starting new tours; I: Metric selection interface.}
\label{fig:GUI_overview}
\end{figure}
\subsection{Overview of the graphical user interface (GUI)}
The \texttt{lionfish} GUI can be launched using the function \texttt{interactive\textunderscore tour()}. At a minimum, the user needs to provide both the dataset and the instructions for constructing the desired plots. The dataset can be supplied as a data matrix, data frame, or data table. Categorical variables can be loaded; however, they will be transformed to numeric values that reflect the categories. The plotting instructions should be passed as a list containing the named elements \texttt{type} and \texttt{obj}. The \texttt{type} element specifies the type of display to generate, such as \texttt{scatter} for a scatterplot or \texttt{2d\textunderscore tour} for a 2D tour. For a comprehensive list of the currently implemented \texttt{type} elements please visit the software's website. The \texttt{obj} element further defines the properties of the chosen display. For example, to create a 2D tour, the user must provide a \texttt{tour\textunderscore history} object, which can be generated using the \texttt{tourr} package. For a scatterplot, the user needs to provide a character vector specifying the names of the features to be displayed. With the optional arguments of \texttt{interactive\textunderscore tour()} the user can specify the feature names, the arrangement of the plots within the GUI, predefined subsets of the data (e.g., cluster solutions), custom names for these subsets, the number of available subsets, and the size of each plot.
The GUI is divided into two main sections: a sidebar on the left, which contains a comprehensive set of interactive controls (Figure \ref{fig:GUI_overview}A), and the display area on the right Figure (\ref{fig:GUI_overview}B), where the selected plots are shown.
At the top of the sidebar, users can select and deselect features using checkboxes (Figure \ref{fig:GUI_overview}C), thereby controlling which features are displayed in the plots. Below this, a list of the currently loaded subsets of the data is displayed. The subsets can be defined by the user when launching the GUI. In the context of the market segmentation use cases discussed in Section \ref{applications}, these subsets are clusters. However, depending on the specific application subselection labels from other sources may be used, which is why they are referred to as subselections in this section. Each subset has its own checkbox to designate the active subset (Figure \ref{fig:GUI_overview}D). When data points are manually selected in the plots, they will be assigned to the active subset and be colored accordingly. For scatterplots, data points can be selected by encircling them directly on the plot while holding down the left mouse button. For barplots, clicking on a specific bar selects the data represented by that bar. The colored boxes next to the subset names indicate the assigned colors for the data points. Clicking on these boxes adjusts the transparency of the points, which is helpful for highlighting and comparing subsets. Subsets can also be renamed using the provided text boxes. The \texttt{Reset original selection} button allows users to revert the subset selections to their initial state. All plots displayed in the GUI are linked, meaning that changes in one plot will affect the other plots. For instance, if observations are moved from one subset into another subset in one plot then the reassignment will also occur in the other plots.
The frame selection interface (Figure \ref{fig:GUI_overview}E) displays the frames of the currently shown tours and enables users to jump between frames of the tour objects, by specifying the frame of the projection to be displayed and pressing the \texttt{Update frames} button. One can also move through the projections of a tour by pressing the arrow keys.
Below this, there are three additional interfaces (Figure \ref{fig:GUI_overview}F). The "Number of bins of histograms" can be used to adjust the number of bins in histograms — more bins result in higher resolution but slower display updates. The "Animate" element allows users to animate tours, automatically shifting to the next projection after a specified amount of time. The "Blend out projection threshold" offers the ability to hide projection axes with a norm smaller than a chosen threshold. This can be helpful as projection axes with a small norm have little influence on the shown projection, and displaying all of them can be distracting.
Users can also save and load projections and subsets using the respective buttons (Figure \ref{fig:GUI_overview}G). The \texttt{Save projections and subsets} button allows users to preserve the current state of their analysis. This not only includes the displayed projections and active tours, but also various minor settings, such as the highlighted subset or the value for the "Number of bins of histograms". These states can be recovered by pressing the \texttt{Load projections and subsets} button, which spawns a file browser with which one can select a folder containing previously saved files.
New tours can be started directly from within the GUI (Figure \ref{fig:GUI_overview}H).The options for new tour paths are: a local tour around the currently shown projection and guided tours that search for projections based on the holes, or the linear discriminant analysis (LDA) index. The holes index is sensitive to projections with few points in the center of the projections and the LDA index aims to maximize the distance between the centers of the selected subsets \citep{ggobi}. Additionally, all tour displays allow users to perform a manual tour; by right-clicking and dragging the arrowheads of projection axes, the projection is recalculated accordingly. This allows for manual exploration of the data.
At the bottom of the interactive controls users can select different metrics for some plot types e.g. heatmaps (Figure \ref{fig:GUI_overview}I).
\subsection{R/Python interface}
The majority of \texttt{lionfish} was written in \proglang{Python} \citep{python}. The \proglang{R} \citep{R} side of the package handles setting up the \proglang{Python} environment where the interactive interface is being run, launching the interactive tour, and generating new tours when initiated through the interface. To incorporate the functionality of the \texttt{tourr} package without translating large portions of its code from \proglang{R} to \proglang{Python}, the \texttt{reticulate} package was used. This approach allows \texttt{lionfish} to automatically benefit from updates to \texttt{tourr} and avoids the necessity to reimplement code in \proglang{Python} that was already written in \proglang{R}.
To minimize inefficiencies associated with cross-language communication and to simplify debugging, \texttt{tourr} functions were accessed only when necessary and otherwise implemented directly in \proglang{Python}. For instance, the orthonormalization of projection axes via the Gram-Schmidt process, linear transformations of the data with the projection matrices, or scaling of the data with the half range parameter for visualization purposes, were implemented in \proglang{Python}. Implementing these functions directly in \texttt{lionfish} improves its stability as future updates of \texttt{tourr} or \texttt{reticulate} could affect proper interaction between the packages.
\subsection{Structure of the Python code}
The central component of the \proglang{Python} code is the \texttt{customTKinter} class \\ \texttt{InteractiveTourInterface}. This class centrally stores attributes related to all plots, such as the dataset, sub-selections, feature selections, and other shared information. Plot-specific data is organized in dictionaries (the \proglang{Python} equivalent of named lists in \proglang{R}), including the display type, construction instructions, tour projections (only in case of tour displays), color schemes for the displayed data, and, where applicable, the selector and manual projection manipulation classes.
The selector classes handle the behavior when users manually select data points to move them to the active subset. After a selection is made, the selector class updates the centrally stored sub-selection attribute and ensures all other displays reflect these changes. The manual projection manipulation classes construct the arrows representing the projection axes in the displays and manage the manual adjustment of projections, thus enabling manual tours. Users can right-click and drag the arrowheads to modify the projections, after which the class orthonormalizes the projection axes and updates both the projection and the transformed data accordingly.
\subsection{Fast-drawing enhancements}
The implementation of bit-blitting was crucial to ensuring fast plot updates and providing a smooth user experience. With bit-blit, the static elements of the display, such as the outer frames of the plots, are stored as a background image. When a plot is manipulated, only the affected plot is updated, and within that, only the interactive elements, such as the data points and projection axes in a 2D tour, are rendered on top of the background image.
In practice, this means that the background, without the interactive elements, must be captured either during initialization or after major updates. The entire plot is first rendered without the interactive elements, the background is then saved, and finally, the plot is redrawn with the interactive elements blended in. Since this process is relatively slow, full updates are only triggered during initialization or after significant changes, such as modifications to the set of active features.
\section{Workflow with the lionfish package}~\label{workflow}
While \texttt{lionfish} can be used with data subsets of any origin, or even without any subsetting at all, we will focus on analyzing cluster solutions in this section. Before launching the \texttt{lionfish} interface, the user should have performed a partitioning of their choice, and provide the initial clustering solution to the interface. They can then launch the interface to explore and potentially refine this solution.
\subsection{Feature and cluster relationships}
A first step is to assess which features are important for the cluster solution. The interface provides different capabilities that support feature selection: tour view and summary heatmaps.
The 1D and 2D tour views can be used to understand the sensitivity of the grouping to individual features or interactions thereof. Examples of these display types can be seen in Figure \ref{fig:GUI_overview} where a 2D tour projection is shown on the top left and a 1D tour projection on the bottom left. The arrows shown in the plots indicate the loadings of the linear combination of the projection vector (1D projection) or matrix (2D projection). Short arrows indicate that the loading of a given feature is small, thus the influence of that feature on the projection is small. On the other hand, long arrows indicate a large influence on the projected data. Furthermore, we can identify relationships between the loadings and the data. Looking at Figure \ref{fig:GUI_overview}, we can see in both projections that the blue observations are in the direction that the "alpine skiing" axes points in. We can infer from this that the blue observations generally had a large positive value of "alpine skiing". By clicking and dragging an arrowhead with the right mouse button we can interactively change the loadings of the displayed projections. This functionality is referred to as a manual tour. It can be used to change the influence of the features interactively by turning the data in multidimensional space and projecting it into one or two dimensions. This is useful for examining the relationship between the features and the partition between clusters. The ideal starting view is that obtained through a projection pursuit guided tour~\citep{CBCH94} that was optimized for the separation between the labeled groups~\citep{lckl2005}.
The summary heatmaps provide an overview of the cluster compositions relative to the features. An example can be seen in Figure \ref{fig:GUI_overview} in the top right. Consider the matrix,
\[
C = \left[ \begin{array}{cccc} c_{11} & c_{12} & \dots & c_{1p} \\
c_{21} & c_{22} & \dots & c_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
c_{k1} & c_{k2} & \dots & c_{kp}
\end{array} \right]
\]
where $c_{ij}, i=1, ..., k$ (number of clusters); $j=1, ..., p$ (number of features) are a summary statistic of each feature in each cluster. Because our examples use binary features, the $c_{ij}$ is the number of 1's of feature $j$ in cluster $i$. %For this setting we have developed a heatmap display that we have found to be particularly useful in feature selection. The idea is to understand differences in the binary ratings between clusters, where we use different normalizations to better explore the data. For \(c_{ij}\) the count (number of true values in the binary feature) in the \(i\)-th cluster and \(j\)-th feature, we define
There are several ways that these values can be normalized to examine different aspects: $f_{ij}^{o} = \frac{c_{ij}}{n}$, $f_{ij}^{c} = \frac{c_{ij}}{n_{i}}$ and $f_{ij}^{f} = \frac{c_{ij}}{n_{j}}$, where $n_i, n_j$ are the row and column totals. The first, $f^{o}$ is the overall fraction, where counts are normalized by the overall number of observations $n$. It gives a quick overview on the overall magnitude of the features.
The second, $f^c$ is normalized relative to the size of each cluster (row sum of $C$) and can be considered the distribution of features in each cluster. It is useful for comparing the composition of each cluster relative to the features. For example, for $p=4$, if $f^c_{1j} = (0.9, 0.2, 0.1, 0.1)$ it suggests that high values (1's) of feature 1 distinguishes cluster 1, and that values of the other features were low. In the context of the later examples, this would mean cluster 1 contains tourists that especially engaged in activity 1, but not in activities 2, 3 and 4. This normalization produces what is called the {\bf intra-cluster fraction}. In the case of binary features, it equates to the cluster means.
The last, $f^f$ is normalized relative to each feature (column sum of $C$), which can be considered to be the distribution of clusters on each feature. It is useful to examine how features are related to a cluster. Considering the last metric, for example, if $f^f_{i1} = (0.2, 0.7, 0, 0.1, 0)$ ($k=5$) would indicate that high values on feature 1 primarily are in cluster 2. In the context of the later examples, this would mean activity 1 is most commonly listed in cluster 2. This normalization produces what is called the {\bf intra-feature fraction}. It also has to be noted that this metric is heavily influenced by the cluster sizes. Consider a feature that is uniformly distributed across all clusters, then the largest cluster would receive the highest intra-feature fraction for that feature.
In this example, while the intra-cluster fraction for cluster 1 and feature 1 is notably high at 0.9, the intra-feature fraction is comparatively low at 0.2. Applied to the subsequent analysis, this indicates that 90\% of tourists in cluster 1 participated in activity 1, yet only 20\% of all tourists who engaged in activity 1 were part of cluster 1. The majority of individuals participating in activity 1 were members of cluster 2. This discrepancy may be attributed to cluster 1 representing a relatively small group of tourists, characterized by a pronounced preference for activity 1 and a lack of interest in other activities.
The metrics currently implemented in the heatmap interface, as described in this section, are designed for binary data. To expand the heatmap interface to accommodate ordered categories or numerical scores, one could indicate the cluster and feature means on the edges of the heatmap, with the cluster-specific feature means displayed within each element. This, however, has not been implemented yet.
\subsection{Subset selection}
The spin-and-brush approach suggests to cluster data manually when using a tour: we run a tour animation, stop when we see a group of points that are different from the rest of the distribution, brush them, and then continue. Different projections will enable the separation of different groups, and for well-separated clusters we will be able to recover a full cluster solutions in this manner.
A similar approach can be used to refine a partitioning solution. In a \textit{visual analytics} approach we use interactive visualizations to integrate human judgment with statistical and machine learning models~\citep{keim2010mastering} to optimize knowledge extraction from data.
Here this is in particular useful to integrate prior knowledge or business interests in a given cluster solution. In the interface we can keep the provided clustering, but separate out new subsets via manual selection, for example after we found a group of particular interest via a manual tour.
\subsection{Reproducibility}
Ensuring the reproducibility of data analysis is a fundamental principle in scientific research. It allows others to verify the validity of the findings and is key to the integrity of the scientific process. Reproducibility not only builds trust in the research outcomes but also enables the scientific community to build upon existing work. When analyses can be replicated, it can be validated whether the conclusions drawn from the data are robust and not dependent on the specific conditions or idiosyncrasies of the original analyst. Moreover, reproducible research can serve as a foundational building block for subsequent studies, fostering incremental advancements in knowledge.
One challenge in the context of interactive data analysis is that not all steps of the analysis are precisely documented in the form of code, especially when using graphical user interfaces (GUIs) where user-driven interactions might not leave a traceable history. This lack of documentation can hinder the ability of others to reproduce the analysis or to understand how specific results were obtained. To mitigate this challenge, it is essential to implement mechanisms that allow users to easily save and share intermediate snapshots of their analyses.
One measure to combat this is to make saving intermediate snapshots of the analysis easy and accessible. Specifically, the \texttt{Save projections and subsets} button enables users to take snapshots of their analysis, including visual representations, selected data, and parameter settings. Upon pressing this button, the user can first select, which of the following files they want to save:
\begin{itemize}
\item A \texttt{.png} file containing the currently displayed graphics,
\item \texttt{.csv} files that capture the feature and subset selection as well as projections of the tours displayed at the time of the snapshot,
\item two \texttt{.pkl} files that contain state features of the GUI, allowing for complete recovery of the snapshot.
\end{itemize}
Then a file browser is triggered, allowing users to specify the destination for saving their snapshot. The saved files provide dual utility. First, they allow users to fully recover the state of the analysis within the GUI (which requires saving the .csv and .pkl files). This can be achieved either by using the \texttt{Load projections and subsets} button, or by launching a new GUI instance with the \texttt{load\_interactive\_tour()} function. The latter approach, using \texttt{load\_interactive\_tour()}, has the added flexibility of only requiring the original dataset and the directory containing the saved files. This function also allows users to modify display settings, such as adjusting the size of the interactive plots or changing the arrangement of the display grid. In contrast, when loading the saved state directly from within the GUI, it is crucial that the active session was initiated with the same dataset and plot objects that were present at the time of saving. This ensures that the analysis environment is accurately replicated.
Second, the saved \texttt{.csv} files provide a way to inspect and further analyze the data outside of the original interface. This opens up opportunities for deeper analysis and extensions of the work.
This level of interactivity and documentation is crucial for reproducibility, as it ensures that even exploratory, interactive data analysis can be retraced and validated by others. Ultimately, these features facilitate a reproducible workflow that balances the flexibility of interactive exploration with the rigor of reproducible research.
\section{Applications}~\label{applications}
In this section, the Austrian Vacation Activities \citep{dolnicar2003winter}, Australian Vacation Activities \citep{cliff2009formative} and Tourist Risk-taking \citep{dolnicar2017peer} datasets are analyzed with \texttt{lionfish} to illustrate the software's capabilities.
\subsection{Austrian Vacation Activities dataset}
The Austrian Vacation Activities dataset comprises responses from 2,961 adult tourists who spent their holiday in Austria during the 1997/98 season. Participants were asked to evaluate the importance of 27 different activities during their vacation. The survey categorized responses based on four levels of importance: ``totally important'', ``mostly important'', ``a bit important'', and ``not important''. The original authors binarized the responses for their analysis: a value of 1 was assigned if the activity was rated as ``totally important'', and a value of 0 if any of the other categories were selected. The survey was conducted by the Europäisches Tourismus Institut GmbH at the University of Trier and focused exclusively on tourists who did not stay in the country's capital cities.
When working with projections involving binary data, distinct groupings emerge when two features dominate the projection. Figure \ref{fig:winter-gt} illustrates this with a scatterplot of alpine skiing and cross-country skiing (Figure \ref{fig:winter-gt}A), alongside three projections from a grand tour of the data (Figures \ref{fig:winter-gt}B-D). In these projections, the influence of alpine skiing and cross-country skiing decreases progressively from B to D. When only two binary features are plotted (Figure \ref{fig:winter-gt}A), only four distinct points are visible. As additional features begin to influence the projections, the data becomes more diffuse, with the final projection (Figure \ref{fig:winter-gt}D) starting to appear continuous.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{winter-gt.pdf}
\caption{Scatterplot of the features alpine skiing and cross-country skiing (A) and three 2D projections from a grand tour on the Austrian Vacation Activities data (B-D). Because the 27 features are binary, you'll see distinct groupings when few features dominate the projection (A-B) but with this many binary features the data mostly looks continuous when the loadings of the other features increase(C-D).}
\label{fig:winter-gt}
\end{figure}
To gain further insight into the dataset a $k$-means clustering as described in \cite{leisch2018market} has been performed. Therefore, the function \texttt{stepcclust} of the \proglang{R} package \texttt{flexclust} \citep{flexclust} with $k=6$ and $nrep=20$ was used.
\subsubsection{Feature selection}
The original dataset contains $p$=27 features. Some of these features are more informative than others, so it may be beneficial to drop some of them. Although this may result in a loss of information, the projections can become difficult to interpret if too many variables are plotted. Also, interacting with the projection axes in the \texttt{lionfish} GUI can become cumbersome when handling more than \textasciitilde 15 features at once. To counteract this, one can use the feature selection capabilities of \texttt{lionfish}, which allow for quick on-the-fly removal and addition of features. An effective and intuitive way to perform feature selection is to use the heatmap display within \texttt{lionfish}.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{winteractiv_heatmap.pdf}
\caption{Traditional overview of clusters. Color represents (A) the intra-cluster fraction, and (B) the intra-feature fraction. From A, we can see that cluster 3 tourists like alpine skiing, going out in the evening and going to discos and bars. They also like relaxing, shopping and sightseeing but these are popular among all tourists. From B, we can see the distribution of activities across clusters, e.g. most tourists who use health facilities are found in cluster 4 while tourists going to a pool or sauna are primarily found in clusters 1 and 5.}
\label{fig:winteractiv_heatmap}
\end{figure}
In Figure \ref{fig:winteractiv_heatmap}A, where colors show normalised column counts, we can observe the general interests of tourists within each cluster. Some activities are high on almost all clusters (e.g. relaxing, shopping), and some are low on all clusters (e.g. ski touring and horseback riding), and ignoring these can be helpful when assessing the distribution of activities between clusters. When comparing clusters 5 and 6 we can see that alpine skiing is high in both, but cluster 5 tourists also like going to the pool or sauna, but cluster 6 tourists prefer going for walks. %For instance, tourists in clusters 1, 3, 5, and 6 predominantly engaged in alpine skiing, while those in clusters 2 and 4 did not. Additionally, we see that activities such as ski touring and horseback riding were generally unpopular across all clusters.
In Figure \ref{fig:winteractiv_heatmap}B, we can determine whether tourists selecting a particular feature are equally distributed or if they primarily fall within one or a few clusters. For example, nearly all tourists who visited museums are grouped in cluster 2, and those who used health facilities are primarily attributed to cluster 4. Some activities, such as relaxing, are popular across all clusters.
These heatmaps can help with selecting features to focus on using the tour. Unpopular and universally popular activities can be removed. After performing the feature selection by unchecking the corresponding checkboxes in the GUI using this strategy, the following 12 activities remained: alpine skiing, going to a spa, using health facilities, hiking, going for walks, excursions, going out in the evening, going to discos/bars, shopping, sightseeing, museums, and pool/sauna.
We can now repeat the $k$-means clustering with \texttt{stepcclust} on the reduced dataset. Silhouette plots of both cluster solutions can be seen in Figure \ref{fig:silhouette_comparison}. By comparing both silhouette plots, we can see that the cluster solution with the reduced dataset results in a clustering of higher quality. Thus, we will continue with the analysis on the reduced dataset with the corresponding cluster solution. We can also see in Figure \ref{fig:silhouette_comparison}B that cluster 3 is of comparatively low quality. It is important to note that the silhouette scores were generally quite low, reflecting the lack of clearly separable clusters in the data.
This is common in market segmentation analysis, where distinct clusters are often not clearly separable. As a result, clustering and feature selection algorithms may converge to a local optimum, and even if the global optimum is found, it may not necessarily be useful. For instance, a feature selection algorithm might drop features to find a better optimum according to an objective function. However, this can be problematic if an analyst is interested in the influence of the dropped features. Consequently, a purely data-driven analysis can become counterproductive. The primary aim of this analysis is to gain insights into the data and understand the underlying patterns, rather than to identify an objectively optimal clustering configuration.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{silhouette_comparison.pdf}
\caption{Comparison of silhouette plots of two $k$-means cluster solutions of the Austrian vacation activities dataset with $k$=6. (A) shows the silhouette plot of the $k$-means solution of the full dataset and (B) the silhouette plot of the $k$-means solution of the dataset after manual feature selection. We can see that the cluster solution with the reduced dataset achieved better silhouette scores and that clusters 1 and 3 contain observations with negative silhouette scores.}
\label{fig:silhouette_comparison}
\end{figure}
We can further explore the similarities between the clusters by initializing an \\ \texttt{interactive\_tour()} with a 2D tour based on the linear discriminant analysis (LDA) projection pursuit index. By navigating through the tour, we can observe various projections, and when a projection that separates the clusters is found, we highlight each cluster sequentially. The different highlighted clusters can be seen in Figure \ref{fig:winter_activ_cluster_highlights}.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{winter_activ_cluster_highlights.pdf}
\caption{Display of a projection of the Austrian vacation activities dataset with the six clusters of the $k$-means cluster solution highlighted in different colors. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. The colors indicate which clusters the highlighted observations belong to with cluster 1 being blue (A), cluster 2 being orange (B), cluster 3 being green (C), cluster 4 being red (D), cluster 5 being violet (E) and cluster 6 being brown (F). Some data points appear to be highlighted always, which occurs due to the overlap of many data points in one spot. We can see that the projection separates some clusters well, however also that there is considerable overlap of clusters 4, 5 and 6.}
\label{fig:winter_activ_cluster_highlights}
\end{figure}
This process allows us to visually assess the separation and similarities between the clusters, providing insight into the structure of the dataset. By highlighting each cluster individually, we can evaluate their distinctiveness in different projections. The most influential features shown in Figure \ref{fig:winter_activ_cluster_highlights} are pool/sauna, alpine skiing, museums, going to the spa, going for walks and sightseeing. The projection roughly separates clusters 1 (blue), 2 (orange), and 3 (green) from each other and the other three clusters (red, violet and brown), which appear to be quite similar in the selected projection.
By manually manipulating the projection axes or initiating a local tour, we can gain further insight into the similarities between the different clusters. This interactive exploration allows for a more nuanced understanding of the relationships between clusters and the influence of key features on the separation of the data.
\subsubsection{Redefining cluster assignments - learning more about museum goers}
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{winter_cl7_init.pdf}
\caption{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with the linear discriminant analysis projection pursuit index. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with the linear discriminant analysis projection pursuit index. Bottom right: mosaic plot. Tourists in both clusters 1 and 3 didn't participate in skiing a lot, but tourists in cluster 3 were much more interested in going to the pool, spa and health facilities compared to cluster 1.}
\label{fig:winter_cl7_init}
\end{figure}
There are several reasons why we might want to manually modify a clustering solution. One is to capture observations that do not fit well within their assigned clusters. Another reason is to explore specific features in more detail. The advantage of manual cluster selection is that it preserves most of the original clustering structure, allowing us to adjust specific parts of the solution without starting from scratch. This approach is particularly useful when we already have a cluster solution that reveals interesting patterns in the data.
In Figure \ref{fig:silhouette_comparison}B, we observed that clusters 1 and 3 of the reduced dataset contained data points that did not fit well into their respective clusters. To further investigate this, we can initialize an \texttt{interactive\_tour()} with the following components:
\vspace{2cm}
\begin{itemize}
\item A 2D tour using the linear discriminant analysis projection pursuit index,
\item A heatmap showing the intra-cluster fraction,
\item A 1D tour with the linear discriminant analysis projection pursuit index, and
\item A mosaic plot.
\end{itemize}
This setup produces the display shown in Figure \ref{fig:winter_cl7_init}. In the heatmap (top right), it can be seen that both clusters 1 and 3 contain tourists that didn't go alpine skiing and that the main difference between them is that tourists in cluster 3 enjoyed going to the spa and health facilities as well as going to the pool, while the ones in cluster 1 didn't. Other than that, the clusters were mostly similar. Now we might be interested in the clusters of tourists that enjoy going to museums. Therefore, we can adjust the projection axes so that these axes are elongated and point into different directions. We can see that there is indeed overlap between clusters 1 and 3, as shown in the 2D projection and heatmap Figure \ref{fig:winter_cl7_pre}. As a next step we can reassign the overlapping section to a new cluster - cluster 7 (pink). By selecting the checkbox for cluster 7 and manually selecting the region of overlap, we can form a new cluster, which is visualized in Figure \ref{fig:winter_cl7_post}.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{winter_cl7_pre.pdf}
\caption{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset, with manually adjusted projections. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with a manually adjusted projection. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with a manually adjusted projection. Bottom right: mosaic plot. Changing the projection axes of ``going to a spa'', ``museums'', ``sightseeing'' and ``using health facilities'' reveals the preferences and overlap of clusters 1 and 3.}
\label{fig:winter_cl7_pre}
\end{figure}
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{winter_cl7_post.pdf}
\caption{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Austrian vacation activities dataset, after new clusters have been selected manually. Projection axes with a norm < 0.1 are not shown as these axes have little influence on the projection, but reduce clarity of the plot. Top left: 2D tour with a manually adjusted projection. Top right: heatmap with the intra-cluster fraction. Bottom left: 1D tour with a manually adjusted projection. Bottom right: mosaic plot. We can see that now almost all museum-goers are in the new manually selected cluster 7 and what their preferences are compared to the other clusters.}
\label{fig:winter_cl7_post}
\end{figure}
In Figure \ref{fig:winter_cl7_post}, we can observe slight behavioral differences between tourists in clusters 1 (blue) and 7 (pink). Both the 2D projection and the heatmap indicate that, tourists in cluster 7 all enjoyed both museums and sightseeing, whereas most tourists in cluster 1 engaged in sightseeing but showed no interest in museums. Instead, we can take from the heatmap in Figure \ref{fig:winter_cl7_post} that participants in cluster 1 exhibited a greater preference for hiking. Despite this, tourists in both clusters generally shared similar interests. This insight could be valuable for enhancing museum marketing strategies. While clusters 1 and 7 have overlapping interests, it appears that current marketing efforts may not effectively reach tourists in cluster 1. By increasing targeted marketing at hiking trails, popular excursion destinations, and shopping centers, it may be possible to attract more interest in museums from tourists in cluster 1.
\subsection{Australian Vacation Activities dataset}
The second dataset, the Australian Vacation Activities dataset, includes responses from 1,003 adult Australians who were surveyed through a permission-based internet panel. The survey was conducted in 2007. Participants were asked whether they engaged in 44 specific vacation activities during their most recent vacation within Australia. Similar to the Austrian dataset, responses were binarized: a value of 1 indicates that the participant took part in the activity, while a value of 0 signifies they did not. Surveys where participants claimed they partook in more than 40 activities or no activity at all were removed as they are considered faulty.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{aus_feature_clustering.pdf}
\caption{Dendrogram of the features of the Australian Vacation Activities dataset using the Ward2 algorithm with the Jaccard index. Features that were clustered together are marked by the colored boxes. We can see, which activities had similar patterns in tourist interest.}
\label{fig:aus_feature_clustering}
\end{figure}
\subsubsection{Feature selection}
At first, hierarchical clustering using the Ward2 algorithm \citep{murtagh2014ward} and the Jaccard index was applied to the features. The resulting dendrogram is shown in Figure \ref{fig:aus_feature_clustering}. Based on this clustering, \( k = 15 \) clusters were identified, and generally, only one representative feature from each cluster, which was considered informative, was selected for further analysis. Clusters containing unpopular activities, such as ``Adventure'', which only had 42 participants, were discarded. The cluster containing popular features like ``Beach'', ``Swimming'', ``ScenicWalks'', ``Markets'', ``Sightseeing'', ``Friends'', ``Pubs'', ``BBQ'', ``Shopping'', ``Eating'', ``EatingHigh'', ``Movies'', and ``Relaxing'' was treated differently. Multiple features from this cluster were retained to preserve as much information as possible. After feature selection, the observations were clustered using $k$-means with \( k = 6 \).
\subsubsection{Segmentation of tourists travelling without friends}
In the heatmap displaying the intra-cluster fraction shown in Figure \ref{fig:aus_preselection} (bottom right), we observe that clusters 1, 2, and 6 tend to prefer traveling without friends. We can assume that this group consists of tourists traveling alone, as couples or with their family. Additional features to further subsegment these groups might be interesting for future surveys. As a tourist agency, we might be interested in targeting these travelers more effectively. However, clusters 2 and 6 also include individuals who enjoy spending time with their friends. To further explore the dataset, we can launch an \texttt{interactive\_tour()} with the same configuration as in the previous example. To achieve better separation between the clusters, we can skip to the last frames of a 2D tour optimized for the LDA index.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{aus_preselection.pdf}
\caption{Interactive tour GUI loaded with multiple plots showing different aspects of the $k$-means solution of the Australian vacation activities dataset, with the projection axes of the feature ``Friends'' pointing into one direction and all the other ones into another one. Top left: 2D tour. Top right: 1D tour. Bottom left: Mosaic plot. Bottom right: Heatmap with the intra-cluster fraction. We can see how the feature ``Friends'' and the general activity level (represented by the other projection axes pointing in one direction) separate the data.}
\label{fig:aus_preselection}
\end{figure}
Since we are particularly interested in tourists who did not spend time with friends, we can extend the projection axis ``Friends'' outward to separate the data based on that feature. Datapoints in the opposite direction of the ``Friends'' axes are the tourists we are interested in, which can be seen in Figure \ref{fig:aus_preselection} on the left side of the top left plot. Subsequently, we can pull all other projection axes in one direction to separate data points based on their overall activity level. The resulting projection is shown in Figure \ref{fig:aus_preselection} (top left). Tourists that fall in the direction of the axes generally engage in more activities compared to those in the opposite direction of the projection axes. This separation is evident, as observations in cluster 6 (brown) are located opposite the axes, and the heatmap (Figure \ref{fig:aus_preselection}, bottom right) shows that they did not engage in many activities. Similarly, we can also see that cluster 3 (green), which contains quite active tourists, is shifted towards the direction of the axes.
Using this logic, we can manually segment the data points on the left into three new clusters: active tourists (more upward - cluster 7), moderately active tourists (center - cluster 8), and largely inactive tourists (bottom - cluster 9). The result of this segmentation can be seen in Figure \ref{fig:aus_selection}. Analyzing the heatmap (bottom right), we can identify several interesting patterns.
By comparing cluster 3 (green), which contains active tourists who spent time with their friends, with cluster 7 (pink), the active tourists who traveled without their friends, we notice that cluster 7 showed less interest in visiting the casino, theatre, and chartering a boat.
We also observe distinctive differences in the interests of the clusters that traveled without friends. According to the heatmap on the bottom right of Figure \ref{fig:aus_selection}, cluster 7 was interested in relaxing, shopping, sightseeing, wildlife, going to the beach, visiting pubs, and exploring farms. Given that cluster 7 showed a notable interest in going to pubs, we can assume that the solo travelers and couples in this cluster are looking to meet new people. This insight could be leveraged in a marketing campaign by bundling activities that appeal to cluster 7, creating packages tailored to these tourists. Such packages would provide opportunities for them to connect with other travelers who share similar interests while enjoying their preferred activities. Additionally, some solo travelers and couples might prefer experiences or environments without children, allowing for the creation of adult-oriented packages. The remaining subgroup within cluster 7 consists of families. To better cater to this group, packages can be refined for instance to consider that not all museums or festivals are equally suitable for children, enabling more family-friendly customizations.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{aus_selection.pdf}
\caption{Interactive tour GUI as seen in \ref{fig:aus_preselection}, but after sub-selection of clusters 7 (pink), 8 (grey) and 9 (gold). Top left: 2D tour. Top right: 1D tour. Bottom left: Mosaic plot. Bottom right: Heatmap with the intra-cluster fraction. We can observe the preferences of three new subsegments, which can be interpreted as very active (pink), moderately active (grey) and inactive (gold) tourists traveling without their friends.}
\label{fig:aus_selection}
\end{figure}
Although cluster 8 (grey) was generally less active, almost everyone still engaged in sightseeing. We can see in the heatmap on the bottom right of Figure \ref{fig:aus_selection} that for them the focus was on sightseeing, relaxing, shopping, and going to the beach, with much less interest in other activities. It can be assumed that solo travelers in this group value their time alone. This subgroup could potentially be targeted more effectively by offering sightseeing options with minimal interaction, such as using a phone app to provide information about interesting locations, rather than relying on a tour guide. For couples in this cluster, the focus may be on enjoying quality time together in a relaxed, low-key environment. Marketing strategies could emphasize romantic sightseeing experiences, such as sunset tours or private beach spots. Additionally, offering couples' packages that include spa treatments, leisurely dining experiences, and personalized itineraries could resonate well with this group, allowing them to unwind and connect at their own pace. Families in cluster 8 might prioritize activities that balance relaxation and light exploration, particularly in child-friendly settings. Sightseeing tours tailored to families, with engaging and educational content for children, could be a strong fit. Beach outings that offer safe, family-oriented areas or interactive experiences like sandcastle-building competitions could also be appealing. By curating packages that cater to both relaxation and gentle family activities, families can enjoy their vacation with minimal stress.
Finally, since cluster 9 (gold) was notably inactive, one might infer that they prefer spending much of their time in their accommodation. Consequently, these tourists might be most interested in accommodations that offer well-equipped, comfortable living spaces with amenities that cater to relaxation and leisure.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{final_projection_risk.pdf}
\caption{Final frame of a guided tour of the clustered risk data with the LDA index. The color of the datapoints indicates which cluster they belong to with cluster 1 being blue, cluster 2 being orange, cluster 3 being green, cluster 4 being red and cluster 5 being violet. Except for Health, all axes point in one direction indicating the projection primarily separates the groups based on their general risk-taking behavior. Most clusters are separated nicely, but clusters 1 and 4 overlap.}
\label{fig:final_projection_risk}
\end{figure}
\subsection{Tourist Risk-taking dataset}
The final dataset to be analyzed stems from a survey of 563 Australian residents who undertook a holiday of at least four nights in 2015. The respondents were asked about the types of risks they had taken in the past. Six different types of risk were screened: recreational (e.g., rock-climbing, scuba diving), health (e.g., smoking, poor diet, high alcohol consumption), career (e.g., quitting a job without another to go to), financial (e.g., gambling, risky investments), safety (e.g., speeding), and social risks (e.g., standing for election, publicly challenging a rule or decision). The response options for each risk type were on a Likert scale ranging from 1 to 5: never (1), rarely (2), quite often (3), often (4), and very often (5).
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{risk-lda-regroup.pdf}
\caption{Interface of the "Guided Tour - LDA - Regroup" function. The option highlighted by the yellow box indicates which element of the dropdown menu to select to spawn the displayed interface. Within the interface, the user can choose how they want the LDA metric to be computed by activating switches. All clusters selected in one column will be considered as a single cluster when computing the LDA metric. This affects only the calculation of the metric and the resulting projection matrices, not the actual selections. If a switch in the "Ignore" column is activated, the data points within that cluster will not be considered for the generation of the guided tour.}
\label{fig:risk-lda-regroup}
\end{figure}
To analyze the data, we first removed duplicate observations and then performed a $k$-means clustering with $k$=5. As the dataset comprises only six features, no feature selection was performed. Subsequently, a 2D tour with the linear discriminant analysis (LDA) index was conducted, and the results were visualized using the \texttt{lionfish} package. The final projection of the tour can be seen in Figure \ref{fig:final_projection_risk}. In this projection, all axes generally point in the same direction, indicating that the projection primarily separates the data based on general risk-taking behavior. Cluster 5 (violet) is identified as the most risk-averse, while cluster 3 (green) is the most risk-taking. Additionally, cluster 2 (orange) is oriented towards the health axis, suggesting that this cluster is more inclined to take health risks. Clusters 1 (blue) and 4 (red) overlap significantly in the shown projection. Both clusters appear to exhibit moderate risk-taking behavior, with cluster 4 being slightly more averse to health risks.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{regrouped_risk.pdf}
\caption{Final frame of the guided tour generated based on the settings shown in Figure \ref{fig:risk-lda-regroup}. Clusters 1 (blue) and 4 (red) are now much better separated compared to the projection shown in Figure \ref{fig:final_projection_risk}. The most influential features within the data for differentiating between the clusters were career risk and recreational risk, with the former being higher for cluster 1 and the latter for cluster 4.}
\label{fig:regrouped_risk}
\end{figure}
Given the overlap between clusters 1 and 4 in Figure \ref{fig:final_projection_risk}, we are interested in exploring their differences further. To achieve this, we generate a new tour with the LDA index from within the GUI, utilizing the inbuilt function to redefine the groupings considered in the LDA. This was done by selecting "Guided tour - LDA - regroup" from the dropdown menu and then clicking the "Run tour" button, which spawns an interface for the regrouping. By activating the switches for clusters 2, 3, and 5 in the "Ignore" column of the interface, the tour ignores the data from these clusters, focusing on separating clusters 1 and 4. This interface is shown in \ref{fig:risk-lda-regroup}. Alternatively, we could activate the switches for clusters 2, 3, and 5 not in the "Ignore" column, but in another column, e.g., the one labeled "New subgroup 2". This would result in the tour considering these clusters as one cluster. This configuration would essentially try to separate clusters 1 and 4 from each other and all other data points. The final projection of the new tour can be seen in figure \ref{fig:regrouped_risk}. In this projection, it is evident that the largest differences between the two clusters are in career and recreational risk-taking.
\section{Discussion}~\label{discussion}
One might question the necessity of manual exploratory data analysis, considering that it is inherently subjective and relies heavily on intuition. However, in situations like those presented here, where there is no clearly defined or optimal clustering solution or feature selection, manual exploration becomes indispensable. While it is possible to optimize clustering metrics to improve separation between clusters, this alone may not yield conclusions that are useful for practical applications. The lack of clear boundaries and the overlapping nature of clusters in the datasets underscore the limitations of purely automated methods in capturing the complexity and nuance of real-world data.
In such cases, manual exploration allows analysts to interweave expert knowledge, intuition, and specific objectives with the initial clustering solution, which serves as the backbone of the analysis. This approach is particularly valuable when no single solution can be deemed “correct''. For instance, in the Austrian dataset, leveraging the interactive GUI for feature selection enabled us to isolate the most informative activities and gain deeper insights into the preferences of tourists who might be interested in visiting museums. This understanding facilitated targeted recommendations for increasing museum attendance by focusing on tourists frequenting hiking trails, excursion spots, and shopping centers. Such insights are challenging to extract through automated optimization alone.
Similarly, in the Australian dataset, manual exploration allowed us to refine the understanding of tourists traveling without friends by dividing them into three distinct subsegments: highly active, moderately active, and largely inactive tourists. This nuanced segmentation, derived from a blend of automated clustering and manual adjustments, offers a deeper understanding of the varied needs and behaviors within this group, enabling more effective marketing strategies.
The tourist risk-taking dataset was used to show how the interactive selection, together with the guided tour, can be used to fully understand the separation of all clusters in terms of the features. There is no single linear projection that can visualize the separation between all clusters, but we can use the interactivity of \texttt{lionfish} to understand the separation in two steps. First, we find a projection that can separate most clusters, and in a second step we select those clusters that were overlapping and ask for a projection that can best separate those groups.
While some plots supported by the \texttt{lionfish} package are specifically designed for analyzing binary survey data, it is important to emphasize that its capabilities extend far beyond this data type. The versatility of tours has been demonstrated across various applications in the past, showcasing their effectiveness in exploring complex, high-dimensional datasets. The interactive GUI enables seamless exploration of both 1D and 2D tours, regardless of the data being analyzed, providing a powerful tool for uncovering patterns and insights across diverse domains.
\section{Conclusion}~\label{conclusion}
Ultimately, manual data exploration serves as a useful complement to automated methods, providing the flexibility to incorporate context, expert judgment, and specific analytical goals. This approach enables analysts to refine initial results and adapt them to the complexities of real-world scenarios, leading to more nuanced interpretations and actionable insights. The \texttt{lionfish} package offers an organised and responsive interactive tool for conducting such analyses, bridging the gap between automated clustering and exploration.
By integrating interactive visualization capabilities, the \texttt{lionfish} package empowers users to dynamically engage with their data, making it possible to uncover subtle patterns and relationships that might otherwise remain hidden. This is especially valuable in tackling complex datasets with the mindset that an automated solution needs to be validated. The package has broad applicability across various data types and analytical contexts where clustering is used. The flexibility of setting up the GUI elements from the command line allows it to be tailored for different applications.
Future developments in \texttt{lionfish} might include expanding the range of visualization methods and offering additional interactive features, such as enabling the generation of new cluster solutions directly from the GUI. An obvious expansion will be to add support for numeric data to the heatmap interface as described in Section \ref{workflow}. Leveraging the integration of both \proglang{R} and \proglang{Python}, future enhancements could include a seamless combination of algorithms from each language, which broadens the expansion potential of the software. For more complex data shapes, such as those with concavities or non-linear boundaries, the implementation of sliced tours~\citep{Laa2020} is recommended to improve exploratory analysis. Streamlining the addition of new plot types could also further enhance the versatility of the package, making it more adaptable for different data visualization needs. This ongoing integration of \proglang{R} and \proglang{Python} exemplifies how the strengths of both languages can be harnessed for the development of new software packages. A further expansion of this would be to develop a shiny app \citep{shiny} leveraging the package's \proglang{Python} integration.
In summary, \texttt{lionfish} represents a significant advancement in the toolkit of data analysts, offering a novel way to balance automated analysis with human intuition and domain expertise, thereby facilitating a deeper and more comprehensive understanding of complex datasets. We have demonstrated its capabilities by analyzing three different datasets from the domain of market segmentation.
\section{Resources and supplementary materials}~\label{resources}
The source code and documentation can be found at \href{https://mmedl94.github.io/lionfish/}{https://mmedl94.github.io/lionfish/}. The documentation features explanations and demonstrations of all implemented plots types as well as additional use cases not shown in this article. Supplementary material including:
\begin{itemize} \itemsep 0in
\item scripts to start the GUI and reproduce the graphics in the paper.
\item saved state for different parts of the interactive analyses, so that the results can be reproduced.
\end{itemize}
can be found at \href{https://github.com/mmedl94/lionfish_article/}{https://github.com/mmedl94/lionfish\_article/}.
\section{Acknowledgements}~\label{acknowledgements}
The development of the \texttt{lionfish} package has been supported by Google Summer of Code 2024.
%\bibliographystyle{plainat}
\bibliography{lionfish_references.bib}
\end{document}