09_assertive-testing.qmd

---
title: "Assertive Testing"
abstract: ""
format: 
  html: 
    code-line-numbers: true
    fig-align: "center"
editor_options: 
  chunk_output_type: console
bibliography: references.bib
---

![A multiple-choice test](images/Exams_Start..._Now.jpg)

~ Photo by [Ryan McGilchrist](https://en.wikipedia.org/wiki/Multiple_choice#/media/File:Exams_Start..._Now.jpg)

```{r hidden-here-load}
#| include: false

exercise_number <- 1
```

```{r}
#| echo: false
#| warning: false

library(tidyverse)
library(gt)

source("src/motivation.R")

```

```{r}
#| label: tbl-roadmap
#| tbl-cap: "Opinionated Analysis Development"
#| echo: false

motivation |>
  filter(!is.na(Section), Section == "Programming") |>
  select(-`Analysis Feature`) |>
  arrange(Section) |>
  gt() |>
    tab_header(
    title = "Opinionated Analysis Development"
  )  |>
  tab_footnote(
    footnote = "Added by Aaron R. Williams",
    locations = cells_column_labels(columns = c(Tool, Section))
  ) |>  
  tab_source_note(
    source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
  )

```

## Assertive Testing of Data

>While reproducibility drastically reduces the number of errors and opacity of analysis, without assertive testing it runs the risk of applying an analysis to corrupted data, or applying an analysis to data that have drifted too far from assumptions. ~ [@parker]

Assertions are useful for verifying the quality of data. Many of the principles from assertions and unit testing for functions apply:

- Fail fast, fail often
- Fail loudly
- Fail clearly

Assertive testing of data and assumptions is often much squishier than the unit testing and assertions from the previous section. We must now rely on subject matter expertise and experience with the data to develop assertions that can catch corruptions of the data or data processing mistakes. 

> Assertive testing means establishing these quality-control checks – usually based on past knowledge of possible corruptions of the data – and halting an analysis if the quality-control checks are not passed, so the analyst can investigate and hopefully fix (or at least account for) the problem. ~ [@parker]

### `library(assertr)`

[`library(assertr)`](https://docs.ropensci.org/assertr/) is a framework for applying assertions to data frames in R. It works well with the pipe (`%>%` or `|>`) because the first argument of the five main functions is always a data frame. 

::: {.callout-tip}
## Predicate Function

A predicate function is a function that returns a single `TRUE` or `FALSE`.
:::

`verify()` takes a logical expression. If the all values are `TRUE` for the logical expression, the code proceeds. If any value is `FALSE` for the logical expression, the code terminates and returns a diagnostic tibble. 

```{r}
library(assertr)

msleep %>%
  verify(nrow(.) == 83) |>
  verify(sleep_total < 24) |>
  verify(has_class("sleep_total", class = "numeric"))

```
	
```{r}
#| eval: false
msleep %>%
  verify(nrow(.) == 82) |>
  verify(sleep_total < 14) |>
  verify(has_class("sleep_total", class = "character"))

```

```
verification [nrow(.) == 82] failed! (1 failure)

    verb redux_fn     predicate column index value
1 verify       NA nrow(.) == 82     NA     1    NA

Error: assertr stopped execution

```
	
`assert()` takes a predicate function and an arbitrary number of variables. `assert()` will terminate if any values violate the predicate functions. Can apply tests to multiple variables. 

```{r}
msleep %>%
  assert(within_bounds(0, 24), c(sleep_total, sleep_rem, sleep_cycle))

```

`insist()` is like `assert()`, but `insist()` can make assertions based on the observed data (e.g. throw an error is any value exceed four sample standard deviations from the sample mean). 

```{r}
msleep %>%
  insist(within_n_sds(n = 3), sleep_total)

```

`assert_rows()` extends `assert()` so the assertion can rely on values from multiple columns (e.g. row means within a bound or row must have a certain number of non-missing values).

```{r}
msleep |>
  assert_rows(num_row_NAs, within_bounds(0, 5), everything())

```

`insist_rows()`extends `insist()` so the assertion can rely on values from multiple columns. This is less common but can be used to see if any observation exceeds a certain mahalanobis distance from other rows. 

- `verify()` predicate functions
	- `has_all_names()`
	- `has_only_names()`
	- `has_class()`
- `assert()` predicate functions
	- `not_na()`
	- `within_bounds()`
	- `in_set()`
	- `is_uniq()`
- `insist()` predicate functions
	- `within_n_sds()`
	- `within_n_mads()`

:::callout

#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

1. Add a new code chunk to `analysis.qmd`.
2. Run `glimpse(trees)`. 
3. `verify()` that the variables `Girth` is numeric.
4. `assert()` that all three variables are in the interval $[0, \infty)$.
:::

[This vignette](https://docs.ropensci.org/assertr/) demonstrates additional functionality. 

`library(assertr)` is designed to be used early in a workflow. If you want to run the assertions at the end of the workflow and you don't want to see printed tibble after printed tibble, end the chain of code with the following custom function. 

```
#' Helper function to silence output from testing code
#'
#' @param data A data frame
#'
quiet <- function(data) {
  
  quiet <- data
  
}
```

Example: [Boosting Upward Mobility from Poverty](https://github.com/UI-Research/mobility-from-poverty/blob/version2024/10_construct-database/11_construct_county_all.qmd)

### Other Assertions

[`library(tidylog)`](https://cran.r-project.org/web/packages/tidylog/readme/README.html) prints diagnostic information when functions from `library(dplyr)` and `library(tidylog)` are used.

```{r}
#| message: false

library(tidylog)

```

```{r}
math_scores <- tribble(
  ~name, ~math_score,
  "Alec", 95,
  "Bart", 97,
  "Carrie", 100
)

reading_scores <- tribble(
  ~name, ~reading_score,
  "Alec", 88,
  "Bart", 67,
  "Carrie", 100,
  "Zeta", 100
)

left_join(x = math_scores, y = reading_scores, by = "name")

full_join(x = math_scores, y = reading_scores, by = "name")

```

We'll detach tidylog to keep the rest of this document clean.

```{r}
detach("package:tidylog", unload = TRUE)

```

::: {.callout-note}
`library(tidylog)` is excellent for interactive development of data analyses. 

If you look at `library(tidylog)` output *more than once*, then write an assertion to capture the same information. 
:::

#### Missing Values

The following throws an error if the data set contains any missing values. 

```{r}
missing_values <- map_dbl(.x = trees, ~sum(is.na(.x)))

stopifnot(sum(missing_values) == 0)

```

#### Joins

Joins are one of the most dangerous parts of any data analysis. We can think of many different types of joins:

- "one-to-one"
- "one-to-many"
- "many-to-one"
- "many-to-many"

We can provide an expectation for the type of join using the `relationship` argument in `*_join()` functions. This is an assertion.

Consider the test scores data sets from earlier. This should be a one-to-one join because each row in `x` matches at most 1 row in `y` and each row in `y` matches at most 1 row in `x`.

```{r}
math_scores <- tribble(
  ~name, ~math_score,
  "Alec", 95,
  "Bart", 97,
  "Carrie", 100
)

reading_scores <- tribble(
  ~name, ~reading_score,
  "Alec", 88,
  "Bart", 67,
  "Carrie", 100,
  "Zeta", 100
)

left_join(
  x = math_scores,
  y = reading_scores,
  by = "name",
  relationship = "one-to-one"
)

```

Suppose there were two `"Alec"` in either data set. Then this code would throw a loud error.  

#### Pivots

Pivots are also one of the most dangerous parts of any data analysis. We can write tests for the number of rows and the class for the output of pivots. 

Consider `table4a` from `library(tidyr)`. 

```{r}
table4a

```

We want to pivot this data set to be longer because the data set isn't [tidy](https://r4ds.had.co.nz/tidy-data.html). Before writing code to tidy the data, we can probably come up with a few assertions:

- There should be six rows.
- `year` and `cases` should be numeric.

```{r}
table4a_tidy <- table4a |> 
  pivot_longer(
    cols = c(`1999`, `2000`), 
    names_to = "year", 
    values_to = "cases"
  ) |>
  mutate(year = as.numeric(year))

stopifnot(nrow(table4a_tidy) == 6)
stopifnot(class(pull(table4a_tidy, year)) == "numeric")
stopifnot(class(pull(table4a_tidy, cases)) == "numeric")

```

It's easy to get tired and to cut corners. Assertions never rest. 

> Understand, that your assertion is out there. It can't be bargained with. It can't be reasoned with. It doesn't feel pity or remorse or fear. It absolutely will not stop ever until your analysis is correct. ~ [Terminator (sort of)]https://www.youtube.com/watch?v=zu0rP2VWLWw)

## Assertive Testing of Assumptions

Assertive testing of assumptions is the squishiest of everything we've considered testing. We don't want to apply an analysis to data that have drifted too far from the assumptions of analysis. We also don't want to inappropriately apply a set of binary tests (think mechanical null hypothesis testing with p-values).

At the very least, we should include visualizations and diagnostic tests that systematically explore the assumptions of an analysis in our Quarto documents. Then, we can use version control to track if anything changed unexpectedly. 

Beyond that, we need to rely on subject matter expertise to come up with heuristics for assertions. 

## Profiling and Benchmarking

We skipped the questions "If you are not using efficient code, will you be able to identify it."

Human time is expensive. Machine time is cheap. All else equal, we shouldn't worry too much about making our code more efficient. 

Sometimes, it is necessary to make our code more efficient. After all, who cares if our analysis is reproducible if it takes two weeks to run?

::: {.callout-tip}
## Profiling

Profiling is the systematic measurement of the run-time of each line of code. 
:::

::: {.callout-tip}
## Benchmarking

Benchmarking is the precise measurement of the performance of a small piece of code. Typically, the code is run multiple times to improve the precision of the measurement. 
:::

Systematically making code more efficient generally proceeds in three steps:

- Step 1: Profile the entire set of code to identify bottlenecks. 
- Step 2: Benchmark small pieces of code that are responsible for the bottleneck. 
- Step 3: Try to improve the slow pieces of code. Return to step 2 to evaluate the result. 

RStudio has built-in tools for profiling the run time and memory usage of large chunks of code. See [this section](https://adv-r.hadley.nz/perf-measure.html#profiling) of Advanced R to learn more. 

`library(microbenchmark)` has robust tools for benchmarking code. See [this section](https://adv-r.hadley.nz/perf-measure.html#microbenchmarking) of Advanced R to learn more.