Use the simplest data type reasonable. A simpler data type is less likely contain unintended values. As we have seen, a string variable called gender
can simultaneously contain the values "m", "f", "F", "Female", "MALE", "0", "1", "2", "Latino", "", and NA
. On the other hand, a boolean variable gender_male
can be only FALSE
, TRUE
, and NA
.^[The equivalent of R's logical
data type is called a bit
in SQL Server, and a boolean
in Postgres and MySQL. SQLite's integer
is best alternative for boolean variables.]
SQLite does not have a dedicated datatype, so you must resort to storing it as 0
, 1
and NULL
values. Because a caller can't assume that an ostensible boolean SQLite variable contains only those three values, the variable should be checked.
Once you have cleaned a variable in your initial ETL files (like an Ellis), establish boundaries so you do not have to spend time in the downstream files verifying that no bad values have been introduced. As a small bonus, simpler data types are typically faster, consume less memory, and translate more cleanly across platforms.
Within R, numeric-ish variables can be represented by the following four data types. Use the simplest type that adequately captures the information. logical
is the simplest and numeric
is the most flexible.
logical
/boolean/bit,integer
,bit64::integer64
, andnumeric
/double-precision floats.
Categorical variables have a similar spectrum. After logical
types, factor
s are more restrictive and less flexible than character
s.^[In the database world, character
variables are typically represented by the varchar
; when setting the maximum length, consider padding with some extra character. If most of the entries are eight to ten characters, consider using varchar(15)
. Mimicking a factor
variable is more complicated. Factor levels are typically defined in a dedicated table (commonly called a lookup table), and then referenced with a foreign key relationship. In many circumstances saving an R factor to a database, we will use a simple varchar
.]
logical
/boolean/bit,factor
, andcharacter
.
When a boolean variable would be too restrictive and a factor or character is required, choose the simplest representation. Where possible:
- Use only lower case (e.g., 'male' instead of 'Male' for the
gender
variable). - avoid repeating the variable in the level (e.g., 'control' instead of 'control condition' for the
condition
variable).
Almost every project recodes variables. Choose the simplest function possible. The functions at the top are easier to read and harder to mess up than the functions below it
-
Leverage existing booleans: Suppose you have the logical variable
gender_male
(which can be onlyTRUE
,FALSE
, orNA
). Writinggender_male == TRUE
orgender_male == FALSE
will evaluate to a boolean --that's unnecessary becausegender_male
is already a boolean.-
Testing for
TRUE
: use the variable by itself (i.e.,gender_male
instead ofgender_male == TRUE
). -
Testing for
FALSE
: use!
. Write!gender_male
instead ofgender_male == FALSE
orgender_male != TRUE
.
-
-
dplyr::coalesce()
: The function evaluates a single variable and replacesNA
with values from another variable.A coalesce like
visit_completed <- dplyr::coalesce(visit_completed, FALSE)
is much easier to read and not mess up than
visit_completed <- dplyr::if_else(!is.na(visit_completed), visit_completed, FALSE)
-
dplyr::na_if()
transforms a nonmissing value into an NA.Recoding missing values like
birth_apgar <- dplyr::na_if(birth_apgar, 99)
is easier to read and not mess up than
birth_apgar <- dplyr::if_else(birth_apgar == 99, NA_real_, birth_apgar)
-
<=
(or a similar comparison operator): Compare two quantities to output a boolean variable. The parentheses are unnecessary, but can help readability. If either value isNA
, then the result isNA
.Notice that we prefer to order the variables like a number line. When the result is
TRUE
, the smaller value is to the left of the larger value.dob_in_the_future <- (Sys.Date() < dob) dod_follows_dob <- (dob <= dod) premature <- (gestation_weeks < 37) big_boy <- (threshold_in_kg <= birth_weight_in_kg)
-
dplyr::if_else()
: The function evaluates a single boolean variable or expression. The output branches to only three possibilities: the input is (a) true, (b) false, or (c) (optionally)NA
. Notice that unlike the<=
operator,dplyr::if_else()
lets you specify a value if the input expression evaluates toNA
.date_start <- as.Date("2017-01-01") # If a missing month element needs to be handled explicitly. stage <- dplyr::if_else(date_start <= month, "post", "pre", missing = "missing-month") # Otherwise a simple boolean output is sufficient. stage_post <- (date_start <= month)
If it is important that the reader understand that an input expression of
NA
will produce an NA, consider usingdplyr::if_else()
. Even though these two lines are equivalent, a casual reader may not consider thatstage_post
could beNA
.stage_post <- (date_start <= month) stage_post <- dplyr::if_else(date_start <= month, TRUE, FALSE, missing = NA)
-
dplyr::between()
: The function evaluates a numericx
against aleft
and aright
boundary to return a boolean value. The output isTRUE
ifx
is inside the boundaries or equal to either boundary (i.e., the boundaries are inclusive). The output isFALSE
ifx
is outside either boundary.too_cold <- 60 too_hot <- 88 goldilocks_1 <- dplyr::between(temperature, too_cold, too_hot) # This is equivalent to the previous line. goldilocks_2 <- (too_cold <= temperature & temperature <= too_hot)
If you need an exclusive boundary, abandon
dplyr::between()
and specify it exactly.# Left boundary is exclusive goldilocks_3 <- (too_cold < temperature & temperature <= too_hot) # Both boundaries are exclusive goldilocks_4 <- (too_cold < temperature & temperature < too_hot)
If your code starts to nest
dplyr::between()
calls insidedplyr::if_else()
, considerbase::cut()
. -
base::cut()
: The function transforms a single numeric variable into a factor. Its range is cut into different segments/categories on the one-dimensional number line. The output branches to single discrete value (either a factor-level or an integer). Modify theright
parameter toFALSE
if you'd like the left/lower bound to be inclusive (which tends to be more natural for me).mtcars |> tibble::as_tibble() |> dplyr::select( disp, ) |> dplyr::mutate( # Example of a simple inequality operator (see two bullets above) muscle_car = (300 <= disp), # Divide `disp` into three levels. size_default_labels = cut(disp, breaks = c(-Inf, 200, 300, Inf), right = F), # Divide `disp` into three levels with custom labels. size_cut3 = cut( disp, breaks = c(-Inf, 200, 300, Inf), labels = c( "small", "medium", "big"), right = FALSE # Is the right boundary INclusive ('FALSE' is an EXclusive boundary) ), # Divide `disp` into five levels with custom labels. size_cut5 = cut( disp, breaks = c(-Inf, 100, 150, 200, 300, Inf), labels = c( "small small", "medium small", "biggie small", "medium", "big"), right = FALSE ), )
-
dplyr::recode()
: The function accepts an integer or character variable. The output branches to a single discrete value. This example maps integers to strings.# https://www.census.gov/quickfacts/fact/note/US/RHI625219 race_id <- c(1L, 2L, 1L, 4L, 3L, 4L, 2L, NA_integer_) race_id_spouse <- c(1L, 1L, 2L, 3L, 3L, 4L, 5L, NA_integer_) race <- dplyr::recode( race_id, "1" = "White", "2" = "Black or African American", "3" = "American Indian and Alaska Native", "4" = "Asian", "5" = "Native Hawaiian or Other Pacific Islander", .missing = "Unknown" )
If multiple variables have the same mapping, define the mapping once in a named vector, and pass it for multiple calls to
dplyr::recode()
. Notice that the two variablesrace
andrace_spouse
use the same mapping.^[For now, employ the!!!
operator without understanding it. When you're more comfortable with R, read about quosures and lazy evaluation so you can use it in more general scenarios.]mapping_race <- c( "1" = "White", "2" = "Black or African American", "3" = "American Indian and Alaska Native", "4" = "Asian", "5" = "Native Hawaiian or Other Pacific Islander" ) race <- dplyr::recode( race_id, !!!mapping_race, .missing = "Unknown" ) race_spouse <- dplyr::recode( race_id_spouse, !!!mapping_race, .missing = "Unknown" )
Tips for
dplyr::recode()
:- A reusable dedicated mapping vector is very useful for surveys with 10+ Likert items with consistent levels like "disagree", "neutral", "agree".
- Use
dplyr::recode_factor()
to map integers to factor levels. forcats::fct_recode()
is similar. We prefer the.missing
parameter ofdplyr::recode()
that translates anNA
into an explicit value.- When using the REDCap API, these functions help convert radio buttons to a character or factor variable.
-
lookup table: It is feasible to recode 6 levels of race directly in R, but it's less feasible to recode 200 provider names. Specify the mapping in a csv, then use readr to convert the csv to a data.frame, and finally left join it.
-
dplyr::case_when()
: The function is the most complicated because it can evaluate multiple input variables. Also, multiple cases can be true, but only the first output is returned. This 'water fall' execution helps in complicated scenarios, but is overkill for most.
Try to prepend each function with its package. Write dplyr::filter()
instead of filter()
. When two packages contain public functions with the same name, the package that was most recently called with library()
takes precedent. When multiple R files are executed, the packages' precedents may not be predictable. Specifying the package eliminates the ambiguity, while also making the code easier to follow. For this reason, we recommend that almost all R files contain a 'load-packages' chunk.
See the Google Style Guide for more about qualifying functions.
Some exceptions exist, including:
- The sf package if you're using its objects with dplyr verbs.
Don't use the minus operator (i.e., -
) to subtract dates. Instead use as.integer(difftime(stop, start, units="days"))
. It's longer but protects from the scenario that start
or stop
are changed upstream from a date to a datetime. In that case, stop - start
returns the number of seconds between the two points, not the number of days.
Some variables are critical to the record, and if it's missing, you don't want or trust any of its other values. For instance, a hospital visit record rarely useful with a null patient ID. In these cases, prevent the record from passing through the ellis.
In this example, we'll presume we cannot trust a patient record if it lacks a clean date of birth (dob
).
-
Define the permissible range, in either the ellis's declare-globals chunk, or in the config-file. (We'll use the config file for this example.) We'll exclude anyone born before 2000, or after tomorrow. Even though it's illogical for someone in a retrospective record to be born tomorrow, consider bending a little for small errors.
range_dob : !expr c(as.Date("2000-01-01"), Sys.Date() + lubridate::days(1))
-
In the tweak-data chunk, use
OuhscMunge::trim_date()
to set the cell toNA
if it falls outside an acceptable range. Afterdplyr::mutate()
, calltidyr::drop_na()
to exclude the entire record, regardless if (a) it was alreadyNA
, or (b) was "trimmed" toNA
.ds <- ds |> dplyr::mutate( dob = OuhscMunge::trim_date(dob, config$range_dob) ) |> tidyr::drop_na(dob)
-
Even though it's overkill after trimming, (eventually) verify the variable for three reasons: (a) there's a chance that the code above isn't working as expected, (b) later code might have introduced bad values, and (c) it clearly documents to a reader that
dob
was included in this range at this stage of the pipeline.checkmate::assert_date(ds$dob, any.missing=F, lower=config$range_dob[1], upper=config$range_dob[2])
The checkmate::assert_*()
functions will throw an error and stop R's execution when encountering a vector that violates the constraints you specified. The previous snippet will alert you if
ds$dob
is not a date,ds$dob
has at least oneNA
value, ords$dob
has value earlier thanconfig$range_dob[1]
or later thanconfig$range_dob[2]
.
The package has a family of functions that accommodate many types of vectors. Some common conditions to verify are:
-
the vector's values are unique, which arises when you're about to upload a primary key to a database (e.g., a patient ID to the patient table),
checkmate::assert_integer(ds$pt_id, unique = TRUE)
-
a vector's string should follow a strict pattern (e.g., the patient ID is a "A" or "B", followed by 4 digits)
checkmate::assert_character(ds$pt_id, pattern = "^[AB]\\d{4}$")
-
the database doesn't accept names longer than 50 characters
checkmate::assert_character(ds$name_first, min.chars = 50) # or checkmate::assert_character(ds$name_first, pattern = "^.{0,50}$")
The pattern
argument is ultimately passed to base::grepl()
, which leverage regular expressions.
Sometimes a dataset smells fishy even though no single cell violates a constraint. Send up a flare if it's kinda bad, yet stop the execution if it really stinks.
This is especially important for recurring scripts that process new datasets that are never inspected by a human, such as a daily forecast. Even though today's incoming dataset is fine, you shouldn't trust next month's. At worst, the lonely test never catches a violation (and you wasted 5 minutes). At best, it catches a problem that would have proceeded undetected and compromised your downstream analyses.
The following snippet asserts it's acceptable that 2% of patients are missing an age, but it should never get worse than 5%. Therefore it throws an error when the missingness exceeds 5% and it throws an warning if it exceeds 2%.
# Simulate a vector of ages.
ds <- tibble::tibble(
age = sample(c(NA, 1:19), size = 100, replace = TRUE)
)
# Define thresholds for errors & warnings.
threshold_error <- .05
threshold_warning <- .02
# Calculate proportion of missing cells.
missing_proportion <- mean(is.na(ds$age))
# Accompany the error/warning with an informative message.
if (threshold_error < missing_proportion) {
stop(
"The proportion of missing `age` values is ", missing_proportion,
", but it shouldn't exceed ", threshold_error, "."
)
} else if (threshold_warning < missing_proportion) {
warning(
"The proportion of missing `age` values is ", missing_proportion,
", but ideally it stays below ", threshold_warning, "."
)
}