-
Notifications
You must be signed in to change notification settings - Fork 1
/
09_assertive-testing.qmd
353 lines (242 loc) · 10.8 KB
/
09_assertive-testing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Assertive Testing"
abstract: ""
format:
html:
code-line-numbers: true
fig-align: "center"
editor_options:
chunk_output_type: console
bibliography: references.bib
---
![A multiple-choice test](images/Exams_Start..._Now.jpg)
~ Photo by [Ryan McGilchrist](https://en.wikipedia.org/wiki/Multiple_choice#/media/File:Exams_Start..._Now.jpg)
```{r hidden-here-load}
#| include: false
exercise_number <- 1
```
```{r}
#| echo: false
#| warning: false
library(tidyverse)
library(gt)
source("src/motivation.R")
```
```{r}
#| label: tbl-roadmap
#| tbl-cap: "Opinionated Analysis Development"
#| echo: false
motivation |>
filter(!is.na(Section), Section == "Programming") |>
select(-`Analysis Feature`) |>
arrange(Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
tab_footnote(
footnote = "Added by Aaron R. Williams",
locations = cells_column_labels(columns = c(Tool, Section))
) |>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```
## Assertive Testing of Data
>While reproducibility drastically reduces the number of errors and opacity of analysis, without assertive testing it runs the risk of applying an analysis to corrupted data, or applying an analysis to data that have drifted too far from assumptions. ~ [@parker]
Assertions are useful for verifying the quality of data. Many of the principles from assertions and unit testing for functions apply:
- Fail fast, fail often
- Fail loudly
- Fail clearly
Assertive testing of data and assumptions is often much squishier than the unit testing and assertions from the previous section. We must now rely on subject matter expertise and experience with the data to develop assertions that can catch corruptions of the data or data processing mistakes.
> Assertive testing means establishing these quality-control checks – usually based on past knowledge of possible corruptions of the data – and halting an analysis if the quality-control checks are not passed, so the analyst can investigate and hopefully fix (or at least account for) the problem. ~ [@parker]
### `library(assertr)`
[`library(assertr)`](https://docs.ropensci.org/assertr/) is a framework for applying assertions to data frames in R. It works well with the pipe (`%>%` or `|>`) because the first argument of the five main functions is always a data frame.
::: {.callout-tip}
## Predicate Function
A predicate function is a function that returns a single `TRUE` or `FALSE`.
:::
`verify()` takes a logical expression. If the all values are `TRUE` for the logical expression, the code proceeds. If any value is `FALSE` for the logical expression, the code terminates and returns a diagnostic tibble.
```{r}
library(assertr)
msleep %>%
verify(nrow(.) == 83) |>
verify(sleep_total < 24) |>
verify(has_class("sleep_total", class = "numeric"))
```
```{r}
#| eval: false
msleep %>%
verify(nrow(.) == 82) |>
verify(sleep_total < 14) |>
verify(has_class("sleep_total", class = "character"))
```
```
verification [nrow(.) == 82] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA nrow(.) == 82 NA 1 NA
Error: assertr stopped execution
```
`assert()` takes a predicate function and an arbitrary number of variables. `assert()` will terminate if any values violate the predicate functions. Can apply tests to multiple variables.
```{r}
msleep %>%
assert(within_bounds(0, 24), c(sleep_total, sleep_rem, sleep_cycle))
```
`insist()` is like `assert()`, but `insist()` can make assertions based on the observed data (e.g. throw an error is any value exceed four sample standard deviations from the sample mean).
```{r}
msleep %>%
insist(within_n_sds(n = 3), sleep_total)
```
`assert_rows()` extends `assert()` so the assertion can rely on values from multiple columns (e.g. row means within a bound or row must have a certain number of non-missing values).
```{r}
msleep |>
assert_rows(num_row_NAs, within_bounds(0, 5), everything())
```
`insist_rows()`extends `insist()` so the assertion can rely on values from multiple columns. This is less common but can be used to see if any observation exceeds a certain mahalanobis distance from other rows.
- `verify()` predicate functions
- `has_all_names()`
- `has_only_names()`
- `has_class()`
- `assert()` predicate functions
- `not_na()`
- `within_bounds()`
- `in_set()`
- `is_uniq()`
- `insist()` predicate functions
- `within_n_sds()`
- `within_n_mads()`
:::callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Add a new code chunk to `analysis.qmd`.
2. Run `glimpse(trees)`.
3. `verify()` that the variables `Girth` is numeric.
4. `assert()` that all three variables are in the interval $[0, \infty)$.
:::
[This vignette](https://docs.ropensci.org/assertr/) demonstrates additional functionality.
`library(assertr)` is designed to be used early in a workflow. If you want to run the assertions at the end of the workflow and you don't want to see printed tibble after printed tibble, end the chain of code with the following custom function.
```
#' Helper function to silence output from testing code
#'
#' @param data A data frame
#'
quiet <- function(data) {
quiet <- data
}
```
Example: [Boosting Upward Mobility from Poverty](https://github.com/UI-Research/mobility-from-poverty/blob/version2024/10_construct-database/11_construct_county_all.qmd)
### Other Assertions
[`library(tidylog)`](https://cran.r-project.org/web/packages/tidylog/readme/README.html) prints diagnostic information when functions from `library(dplyr)` and `library(tidylog)` are used.
```{r}
#| message: false
library(tidylog)
```
```{r}
math_scores <- tribble(
~name, ~math_score,
"Alec", 95,
"Bart", 97,
"Carrie", 100
)
reading_scores <- tribble(
~name, ~reading_score,
"Alec", 88,
"Bart", 67,
"Carrie", 100,
"Zeta", 100
)
left_join(x = math_scores, y = reading_scores, by = "name")
full_join(x = math_scores, y = reading_scores, by = "name")
```
We'll detach tidylog to keep the rest of this document clean.
```{r}
detach("package:tidylog", unload = TRUE)
```
::: {.callout-note}
`library(tidylog)` is excellent for interactive development of data analyses.
If you look at `library(tidylog)` output *more than once*, then write an assertion to capture the same information.
:::
#### Missing Values
The following throws an error if the data set contains any missing values.
```{r}
missing_values <- map_dbl(.x = trees, ~sum(is.na(.x)))
stopifnot(sum(missing_values) == 0)
```
#### Joins
Joins are one of the most dangerous parts of any data analysis. We can think of many different types of joins:
- "one-to-one"
- "one-to-many"
- "many-to-one"
- "many-to-many"
We can provide an expectation for the type of join using the `relationship` argument in `*_join()` functions. This is an assertion.
Consider the test scores data sets from earlier. This should be a one-to-one join because each row in `x` matches at most 1 row in `y` and each row in `y` matches at most 1 row in `x`.
```{r}
math_scores <- tribble(
~name, ~math_score,
"Alec", 95,
"Bart", 97,
"Carrie", 100
)
reading_scores <- tribble(
~name, ~reading_score,
"Alec", 88,
"Bart", 67,
"Carrie", 100,
"Zeta", 100
)
left_join(
x = math_scores,
y = reading_scores,
by = "name",
relationship = "one-to-one"
)
```
Suppose there were two `"Alec"` in either data set. Then this code would throw a loud error.
#### Pivots
Pivots are also one of the most dangerous parts of any data analysis. We can write tests for the number of rows and the class for the output of pivots.
Consider `table4a` from `library(tidyr)`.
```{r}
table4a
```
We want to pivot this data set to be longer because the data set isn't [tidy](https://r4ds.had.co.nz/tidy-data.html). Before writing code to tidy the data, we can probably come up with a few assertions:
- There should be six rows.
- `year` and `cases` should be numeric.
```{r}
table4a_tidy <- table4a |>
pivot_longer(
cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"
) |>
mutate(year = as.numeric(year))
stopifnot(nrow(table4a_tidy) == 6)
stopifnot(class(pull(table4a_tidy, year)) == "numeric")
stopifnot(class(pull(table4a_tidy, cases)) == "numeric")
```
It's easy to get tired and to cut corners. Assertions never rest.
> Understand, that your assertion is out there. It can't be bargained with. It can't be reasoned with. It doesn't feel pity or remorse or fear. It absolutely will not stop ever until your analysis is correct. ~ [Terminator (sort of)]https://www.youtube.com/watch?v=zu0rP2VWLWw)
## Assertive Testing of Assumptions
Assertive testing of assumptions is the squishiest of everything we've considered testing. We don't want to apply an analysis to data that have drifted too far from the assumptions of analysis. We also don't want to inappropriately apply a set of binary tests (think mechanical null hypothesis testing with p-values).
At the very least, we should include visualizations and diagnostic tests that systematically explore the assumptions of an analysis in our Quarto documents. Then, we can use version control to track if anything changed unexpectedly.
Beyond that, we need to rely on subject matter expertise to come up with heuristics for assertions.
## Profiling and Benchmarking
We skipped the questions "If you are not using efficient code, will you be able to identify it."
Human time is expensive. Machine time is cheap. All else equal, we shouldn't worry too much about making our code more efficient.
Sometimes, it is necessary to make our code more efficient. After all, who cares if our analysis is reproducible if it takes two weeks to run?
::: {.callout-tip}
## Profiling
Profiling is the systematic measurement of the run-time of each line of code.
:::
::: {.callout-tip}
## Benchmarking
Benchmarking is the precise measurement of the performance of a small piece of code. Typically, the code is run multiple times to improve the precision of the measurement.
:::
Systematically making code more efficient generally proceeds in three steps:
- Step 1: Profile the entire set of code to identify bottlenecks.
- Step 2: Benchmark small pieces of code that are responsible for the bottleneck.
- Step 3: Try to improve the slow pieces of code. Return to step 2 to evaluate the result.
RStudio has built-in tools for profiling the run time and memory usage of large chunks of code. See [this section](https://adv-r.hadley.nz/perf-measure.html#profiling) of Advanced R to learn more.
`library(microbenchmark)` has robust tools for benchmarking code. See [this section](https://adv-r.hadley.nz/perf-measure.html#microbenchmarking) of Advanced R to learn more.