index.Rmd

---
title: "Supplementary material for \"Incorporating statistical thinking into visualisation practices for decision-making in operational management\""
author:
  - name: Emi Tanaka
    affiliation: Department of Econometrics and Business Statistics, Monash University, Melbourne, VIC 3800
    email: emi.tanaka@monash.edu
  - name: Jessica Wai Yin Leung
    affiliation: Department of Econometrics and Business Statistics, Monash University, Melbourne, VIC 3800
    email: jessica.leung@monash.edu
  - name: Dianne Cook
    affiliation: Department of Econometrics and Business Statistics, Monash University, Melbourne, VIC 3800
    email: dicook@monash.edu
bibliography: references.bib 
output:
  bookdown::html_document2:
    code_folding: "hide"
    theme: "paper"
    code_download: true
    number_sections: false
---


This is the supplementary material to an invited commentary for @basole2021. We provide all code that are used to generate the figures in the commentary in addition to other supplementary figures (and its code). 


<ul class="fa-ul">
  <li><span class="fa-li"><i class="fas fa-code"></i></span> To see the code, click on the CODE button. </li>
  <li><span class="fa-li"><i class="fas fa-download"></i></span> You can also download the whole R Markdown file from the drop down menu on the top right corner.</li>
  <li><span class="fa-li"><i class="fab fa-github"></i></span> The GitHub repo for this material can be found at https://github.com/emitanaka/supp-visOM. </li>
</ul>

**List of figures** 

* [Figure S1](#fig:mimic-original): Recreating Figure 3 of @basole2021 using `ggplot2`. 
* [Figure S2](#fig:fig3-alt): An alternative design to Figure 3 of @basole2021.
* [Figure S3](#fig:lineup-theirs): Lineup for the tile grid plot used in Figure 3 of @basole2021.
* [Figure S4](#fig:lineup-ours): Lineup for the scatter plot. 
* [Figure S5](#fig:lineup-theirs-false): Lineup for the tile grid plot on data with purposely high association.
* [Figure S6](#fig:lineup-ours-false): Lineup for the scatter plot on data with purposely high association.

<details>
<summary>Code to load library and data</summary>
```{r setup, message = FALSE, warning = FALSE, class.source = 'fold-show'}
library(tidyverse)
library(ggtext)
library(patchwork)
library(readxl)
library(nullabor)
library(here)
library(janitor)
library(scales)
```

```{r knitr-setup, include = FALSE}
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
knitr::opts_chunk$set(fig.path = "images/",
                      dev = c("png", "pdf", "svg"),
                      cache = FALSE,
                      cache.path = "cache/",
                      message = FALSE,
                      warning = FALSE)
options(knitr.duplicate.label = "allow")
```


```{r data, class.source = 'fold-show'}
df_full <- read_xlsx(here("data/MaskedCoverage-Fig3.xlsx")) %>% 
  clean_names() %>% 
  add_row(state = c("OR", "WY", "SD", "WV", "DC", "AL")) %>% 
  mutate(row = case_when(
    state %in% c("ME") ~ 1L,
    state %in% c("VT", "NH") ~ 2L,
    state %in% c("WA", "ID", "MT", "ND", "MN", "IL", "WI", "MI", "NY", "RI", "MA") ~ 3L,
    state %in% c("OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT") ~ 4L,
    state %in% c("CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE") ~ 5L,
    state %in% c("AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC") ~ 6L,
    state %in% c("OK", "LA", "MS", "AL", "GA") ~ 7L,
    state %in% c("TX", "FL") ~ 8L,
                         TRUE ~ 0L),
    col = case_when(
      state %in% c("WA", "OR", "CA") ~ 1L,
      state %in% c("ID", "NV", "UT", "AZ") ~ 2L,
      state %in% c("MT", "WY", "CO", "NM") ~ 3L,
      state %in% c("ND", "SD", "NE", "KS", "OK", "TX") ~ 4L,
      state %in% c("MN", "IA", "MO", "AR", "LA") ~ 5L,
      state %in% c("IL", "IN", "KY", "TN", "MS") ~ 6L,
      state %in% c("WI", "OH", "WV", "NC", "AL") ~ 7L,
      state %in% c("MI", "PA", "VA", "SC", "GA") ~ 8L,
      state %in% c("NY", "NJ", "MD", "DC", "FL") ~ 9L,
      state %in% c("VT", "RI", "CT", "DE") ~ 10L,
      state %in% c("ME", "NH", "MA") ~ 11L,
      TRUE ~ 0L
    ))

df_miss <- df_full %>% 
  filter(!is.na(readmission_rate))
```
</details>

(ref:mimicary) This figure recreates Figure 3 in @basole2021 using the `ggplot2` R-package [@ggplot2]. The code is displayed above by clicking on the CODE button just above the right corner of this plot.

```{r mimic-original, fig.height = 8, fig.width = 18, fig.cap = "(ref:mimicary)"}           
g1 <- df_miss %>% 
  mutate(y = readmission_rate * 100) %>% 
  ggplot(aes(col, row)) +
  geom_point(aes(size = coverage_obscured, color = y), alpha = 0.8) +
  geom_text(aes(label = percent(y/100, 0.01)), nudge_y = -0.1, size = 2.5) +
  labs(color = "Readmission Rate", size = "Coverage") +
  scale_color_gradient2(low = "#3F6E9A", high = "#AB4C30", midpoint = median(df_miss$readmission_rate * 100), mid = "#ffffe0") +
  theme_void() +
  geom_text(data = df_full, aes(label = state), color = "black", nudge_y = 0.05) +
  scale_size(range = c(3, 30)) +
  scale_y_reverse() +
  theme(plot.margin = margin(r = 30))

g2 <- g1 %+% mutate(df_miss, y = colorectal_cancer_screenings) + 
  scale_color_gradient2(low = "#3F6E9A", high = "#AB4C30", midpoint = median(df_miss$colorectal_cancer_screenings), mid = "#ffffe0") +
  labs(color = "Cancer Screening Rate")

g1 + g2 + plot_layout(guides = "collect") 
```

(ref:fig3-alt) The above figure show an alternative plot design to display the information in [Figure S1](#fig:mimic-original) and is the same figure as Figure 1 in the main paper. The plot shows a scatter plot of percentage of readmission and coverage on the left and a scatter plot of percentage of cancer screening and coverage on the right. Both plots are superimposed by a local polynomial regression (displayed as a blue line) with confidence interval for the line (displayed in gray). The code is displayed above by clicking on the CODE button just above the right corner of this plot.
 
```{r fig3-alt, fig.height = 4, fig.width = 8, fig.cap = "(ref:fig3-alt)"}
theme_set(theme_classic())
g1 <- ggplot(df_miss, aes(coverage_obscured * 100, readmission_rate * 100)) +
  geom_point() +
  labs(x = "Coverage (%)", y = "Readmission (%)") +
  geom_smooth(method = loess, formula = y ~ x) +
  annotate("richtext", x = 80, y = 15.3, label.color = NA, fill = "transparent", label = glue::glue("R<sup>2</sup> = {scales::comma(cor(df_miss$coverage_obscured, df_miss$readmission_rate)^2, 0.001)}")) 

g2 <- ggplot(df_miss, aes(coverage_obscured * 100, colorectal_cancer_screenings)) +
  geom_point() +
  labs(x = "Coverage (%)", y = "Cancer Screening (%)") +
  geom_smooth(method = loess, formula = y ~ x) +
  annotate("richtext", x = 80, y = 73, label.color = NA, fill = "transparent", label = glue::glue("R<sup>2</sup> = {scales::comma(cor(df_miss$coverage_obscured, df_miss$colorectal_cancer_screenings)^2, 0.001)}")) 


g1 + g2 
```

```{r lineup-data}
set.seed(2021)
lineup_data <- null_permute("colorectal_cancer_screenings") %>% 
  lineup(true = df_miss, n = 20, pos = 3)
```

(ref:lineup-theirs) The above figure shows a lineup for this tile grid plot where one of the plots is made using the data, and the other nineteen plots are constructed after first permuting the percentage of cancer screening across different states with the missing value structure is preserved. Text and legends have been removed to minimise the bias in reading plots due to the reader being aware of the context. **Which plot strikes the most different to you?**  The code is displayed above by clicking on the CODE button just above the right corner of this plot. Try the other lineups before finding the data plot position at the bottom of this document.

```{r lineup-theirs, fig.height = 10, fig.width = 10, fig.cap = "(ref:lineup-theirs)"}
plot_lineup_theirs <- ggplot(lineup_data, aes(col, row)) +
  geom_point(aes(size = coverage_obscured, color = colorectal_cancer_screenings), alpha = 0.8) +
  theme_void() + 
  scale_color_gradient2(low = "#3F6E9A", high = "#AB4C30", midpoint = median(df_miss$colorectal_cancer_screenings), mid = "#ffffe0") +
  scale_size(range = c(1, 5)) +
  scale_y_reverse(expand = c(0.1, 0.2))  +
  guides(color = "none", size = "none") + 
  facet_wrap(~.sample, ncol = 5) +
  scale_x_continuous(expand = c(0.1, 0.1)) + 
  theme(legend.position = "bottom",
        strip.text = element_text(size = 18, margin = margin(t = 3, b = 3)),
        strip.background = element_rect(color = "black", size = 1.5))

plot_lineup_theirs
```

(ref:lineup-ours) The above figure shows a lineup for the scatter plot design used in [Figure S2](#fig:fig3-alt). The same data used to create [Figure S3](#fig:lineup-theirs) (including the null data) is used to create this lineup. The code is displayed above by clicking on the CODE button just above the right corner of this plot. When you are ready, find the position of the data plot is revealed at the [bottom of this document](#position).

```{r lineup-ours, fig.height = 10, fig.width = 10, fig.cap = "(ref:lineup-ours)"}
plot_lineup_ours <- ggplot(lineup_data, aes(coverage_obscured * 100, colorectal_cancer_screenings)) +
  geom_point() +  
  geom_smooth(method = loess, formula = y ~ x) +
  facet_wrap(~.sample, ncol = 5) +
  scale_x_continuous(expand = c(0.1, 0.1)) + 
  theme(legend.position = "bottom",
        strip.text = element_text(size = 18, margin = margin(t = 3, b = 3)),
        strip.background = element_rect(color = "black", size = 1.5),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.line = element_blank(),
        axis.ticks.length = unit(0, "pt"))

plot_lineup_ours
```


# Same plots with higher associations between variables

The following are plots based on data that purposely modifies cancer screening to induce a higher association with the coverage. This higher association is induced (as shown in the code below) by rearranging data by the coverage and modifying the cancer screening percentage so that it is ordered from low to high. 

```{r data-false, class.source="fold-show"}
df_false <- df_miss %>% 
  arrange(coverage_obscured) %>% 
  mutate(colorectal_cancer_screenings = sort(colorectal_cancer_screenings))

lineup_false_data <- null_permute("colorectal_cancer_screenings") %>% 
  lineup(true = df_false, n = 20, pos = 5)
```

(ref:lineup-theirs-false) The above shows a lineup for the tile grid plot design where the data was purposely manipulated to induce a higher association between the variables of interest. Which plot looks the most strikingly different to you? Try the next lineup to see if you can find the data plot before finding the position of the data plot [here](#position).

```{r lineup-theirs-false, fig.height = 10, fig.width = 10, fig.cap = "(ref:lineup-theirs-false)"}
plot_lineup_theirs %+% lineup_false_data
```

(ref:lineup-ours-false) The above shows a lineup for the scatter plot design with the data that was purposely manipulated so that two variables mapped to the $x$-axis and $y$-axis have a higher association. How easy was it to spot the data plot compared to [Figure S5](#fig:lineup-theirs-false)? You can find the code to generate the above lineup plot by collapsing all codes. 

```{r lineup-ours-false, fig.height = 10, fig.width = 10, fig.cap = "(ref:lineup-ours-false)"}
plot_lineup_ours %+% lineup_false_data
```

# Positions of the data plot {#position}

The positions of the data plot for the lineup are as follows:

* [Figure S3](#fig:lineup-theirs): position 3. 
* [Figure S4](#fig:lineup-ours): position 3. 
* [Figure S5](#fig:lineup-theirs-false): position 5. 
* [Figure S6](#fig:lineup-ours-false): position 5. 

We expect that you would have struggled to find the data plto in [Figure S3](#fig:lineup-theirs) and [Figure S4](#fig:lineup-ours) as we do not observe strong association between the cancer screening rate and coverage. Additionally, we expect that most would spot the data plot in [Figure S5](#fig:lineup-theirs-false) and all would spot the data plot in [Figure S6](#fig:lineup-ours-false). For those that spot the data plot in [Figure S5](#fig:lineup-theirs-false), we suspect it took longer than spotting the data plot in [Figure S6](#fig:lineup-ours-false). 

# Acknowledgement 

We thank @basole2021 for supplying us the synthetic data to draw the above plots. 

# Reference

<details>
<summary>Session Information</summary>
```{r session-info}
sessioninfo::session_info()
```
</details>

  
<br>


```{r copy-for-paper, include = FALSE}
# this chunk must be the last one
fs::file_copy("images/fig3-alt-1.pdf", "paper/", overwrite = TRUE)
fs::file_copy("images/lineup-theirs-1.pdf", "paper/", overwrite = TRUE)
# extract all the R code 
knitr::purl("index.Rmd",
            documentation = 1)
fs::file_move("index.R", "code/plots.R")
# remove this chunk and the knitr setup from the output R code
f <- readLines("code/plots.R")
i1 <- which(str_detect(f, "^## ----knitr-setup"))
chunk_indexes <- which(str_detect(f, "^## ----"))
j1 <- chunk_indexes[chunk_indexes > i1][1] - 1
  
i2 <- max(which(str_detect(f, "^## ----copy-for-paper")))
j2 <- length(f)
out <- f[-c(i1:j1, i2:j2)]
writeLines(out, "code/plots.R")
```