Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add messy_linelist() function #187

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7f689f2
add messy function
joshwlambert Jan 31, 2025
51f3373
update NAMESPACE with messy function
joshwlambert Jan 31, 2025
68005d6
add unit test for messy
joshwlambert Jan 31, 2025
1c2ac7e
add .spelling_mistake function
joshwlambert Jan 31, 2025
38fb7b8
add spelling mistakes to messy function
joshwlambert Jan 31, 2025
6d63de5
add unit tests for messy() spelling mistakes
joshwlambert Jan 31, 2025
bad14ea
add inconsistent_sex option to messy()
joshwlambert Jan 31, 2025
404ad59
add unit tests for messy() inconsistent_sex feature
joshwlambert Jan 31, 2025
a9c8a49
add sex_as_numeric to messy()
joshwlambert Feb 5, 2025
e69fd83
add unit tests for messy() sex_as_numeric = TRUE
joshwlambert Feb 5, 2025
56ed45b
add numeric_as_char to messy()
joshwlambert Feb 5, 2025
c2156e2
add units tests for numeric_as_char in messy()
joshwlambert Feb 5, 2025
a10d4bd
add date_as_char to messy()
joshwlambert Feb 5, 2025
97fa3dd
add unit tests for date_as_char in messy()
joshwlambert Feb 5, 2025
2215677
fix unit test expectations for messy()
joshwlambert Feb 5, 2025
87f010b
add inconsistent_dates to messy()
joshwlambert Feb 5, 2025
256e692
add unit tests for inconsistent_date in messy()
joshwlambert Feb 5, 2025
0df300d
add @description doc to messy()
joshwlambert Feb 5, 2025
5e1d07b
linting messy()
joshwlambert Feb 5, 2025
ec657d9
fix unit test for inconsistent_dates in messy()
joshwlambert Feb 5, 2025
2ed8609
fix object name bug in messy(), relates #183
joshwlambert Feb 10, 2025
4370eef
add int_as_word to messy(), relates #183
joshwlambert Feb 10, 2025
9df1579
add {english} to Imports in DESCRIPTION
joshwlambert Feb 10, 2025
0534bcd
add unit test for int_as_words in messy()
joshwlambert Feb 10, 2025
414dc45
correct error message for incorrect arg in messy()
joshwlambert Feb 10, 2025
b0783c6
add unit test for incorrect arg passed to messy()
joshwlambert Feb 10, 2025
6a5d94f
rename messy() to messy_linelist(), relates #183
joshwlambert Feb 10, 2025
c58dac6
call .check_linelist in messy_linelist()
joshwlambert Feb 10, 2025
ad39526
add unit test for incorrect linelist in messy_linelist()
joshwlambert Feb 10, 2025
447218a
add bullet point to design principles vignette on naming of exported …
joshwlambert Feb 10, 2025
0e5b0f7
add {english} to hard dependencies in design-principles vignette
joshwlambert Feb 10, 2025
10956a8
add messy_linelist to _pkgdown.yml
joshwlambert Feb 10, 2025
687e606
add {cleanepi} to Complinmentary R packages in README
joshwlambert Feb 10, 2025
9fc6f8f
add {messy} to Related projects in README
joshwlambert Feb 10, 2025
11ceef5
Automatic readme update
actions-user Feb 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Depends:
R (>= 4.1.0)
Imports:
checkmate,
english,
epiparameter (>= 0.4.0),
randomNames,
rlang,
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Generated by roxygen2: do not edit by hand

export(create_config)
export(messy_linelist)
export(sim_contacts)
export(sim_linelist)
export(sim_outbreak)
Expand Down
224 changes: 224 additions & 0 deletions R/messy_linelist.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
#' Create messy line list data
#'
#' @description
#' Take line list output from [sim_linelist()] and replace elements of
#' the `<data.frame>` with missing values (e.g. `NA`), introduce spelling
#' mistakes and inconsistencies, as well as coerce date types.
#'
#' @param linelist Line list `<data.frame>` output from [sim_linelist()].
#' @inheritParams create_config
#'
#' @details
#' By default `messy_linelist()`:
#'
#' * Makes 10% of values missing, i.e. converts to `NA`.
#' * Introduces spelling mistakes in 10% of `character` columns.
#' * Introduce inconsistency in the reporting of `$sex`.
#' * Converts `numeric` columns (`double` & `integer`) to `character`.
#' * Converts `Date` columns to `character`.
#' * Converts `integer` columns to (English) words.
#'
#' To change the defaults of `messy_linelist()` arguments can be supplied
#' to `...`.
#'
#' Accepted arguments and their defaults are:
#'
#' * `prop_missing = 0.1`
#' * `missing_value = NA`
#' * `prop_spelling_mistakes = 0.1`
#' * `inconsistent_sex = TRUE`
#' * `sex_as_numeric = FALSE`
#' * `numeric_as_char = TRUE`
#' * `date_as_char = TRUE`
#' * `inconsistent_dates = FALSE`
#' * `int_as_word = TRUE`
#'
#' When setting `sex_as_numeric` to `TRUE`, male is set to `0` and female
#' to `1`. Only one of `inconsistent_sex` or `sex_as_numeric` can be `TRUE`,
#' otherwise the function will error.
#'
#' If `numeric_as_char = TRUE` and `sex_as_numeric = TRUE` then the sex encoded
#' as 0 or 1 is converted to `character`. If `prop_spelling_mistake` > 0 and
#' `numeric_as_char = TRUE` the columns that are converted from `numeric` to
#' `character` do not have spelling mistakes introduced, because they are
#' numeric characters stored as character strings. If
#' `prop_spelling_mistake` > 0 and `date_as_char = TRUE` spelling mistakes are
#' not introduced into dates.
#'
#' The `Date` columns can be converted into an inconsistent format by
#' setting `inconsistent_dates = TRUE` and it requires `date_as_char = TRUE`,
#' if the latter is `FALSE` the function will error.
#'
#' If `numeric_as_char = FALSE` and `int_as_word = TRUE` then the integer
#' columns are converted to `character` string words but the other `numeric`
#' columns are not coerced. Spelling mistakes are not introduced into integers
#' converted to words when `prop_spelling_mistakes` > 0 and
#' `int_as_word = TRUE`.
#'
#' @return A messy line list `<data.frame>`.
#' @export
#'
#' @examples
#' linelist <- sim_linelist()
#' messy_linelist <- messy_linelist(linelist)
#'
#' # increasing proportion of missingness to 30% with a missing value of -99
#' messy_linelist <- messy_linelist(
#' linelist,
#' prop_missing = 0.3,
#' missing_value = -99
#' )
#'
#' # increasing proportion of spelling mistakes to 50%
#' messy_linelist <- messy_linelist(linelist, prop_spelling_mistakes = 0.5)
#'
#' # encode `$sex` as `numeric`
#' messy_linelist <- messy_linelist(
#' linelist,
#' sex_as_numeric = TRUE,
#' inconsistent_sex = FALSE
#' )
#'
#' # inconsistently formatted dates
#' messy_linelist <- messy_linelist(linelist, inconsistent_dates = TRUE)
messy_linelist <- function(linelist, ...) {
.check_linelist(linelist)

args <- list(
prop_missing = 0.1,
missing_value = NA,
prop_spelling_mistakes = 0.1,
inconsistent_sex = TRUE,
sex_as_numeric = FALSE,
numeric_as_char = TRUE,
date_as_char = TRUE,
inconsistent_dates = FALSE,
int_as_word = TRUE
)

# capture dynamic dots
dots <- rlang::dots_list(..., .ignore_empty = "none", .homonyms = "error")
dots_names <- names(dots)

# check arguments in dots match arg list
stopifnot(
"Incorrect argument names supplied to `messy_linelist()`" =
all(dots_names %in% names(args))
)

# replace default args if in dots
args <- utils::modifyList(args, dots)

# check args list after any user changes
checkmate::assert_number(args$prop_spelling_mistakes, lower = 0, upper = 1)
checkmate::assert_logical(args$inconsistent_sex, any.missing = FALSE, len = 1)
checkmate::assert_logical(args$sex_as_numeric, any.missing = FALSE, len = 1)
checkmate::assert_logical(args$numeric_as_char, any.missing = FALSE, len = 1)
checkmate::assert_logical(args$date_as_char, any.missing = FALSE, len = 1)
checkmate::assert_logical(
args$inconsistent_dates, any.missing = FALSE, len = 1
)
checkmate::assert_logical(args$int_as_word, any.missing = FALSE, len = 1)
stopifnot(
"Only one of `inconsistent_sex` or `sex_as_numeric` can be `TRUE`." =
!(args$inconsistent_sex && args$sex_as_numeric),
"`date_as_char` must be TRUE when `inconsistent_dates = TRUE`." =
!(args$inconsistent_dates && isFALSE(args$date_as_char))
)

if (args$inconsistent_sex) {
linelist$sex[linelist$sex == "m"] <- sample(
x = c("m", "M", "male", "Male"),
size = sum(linelist$sex == "m"),
replace = TRUE
)
linelist$sex[linelist$sex == "f"] <- sample(
x = c("f", "F", "female", "Female"),
size = sum(linelist$sex == "f"),
replace = TRUE
)
} else if (args$sex_as_numeric) {
# vectorised switch
linelist$sex <- vapply(
linelist$sex, switch, m = 0L, f = 1L, FUN.VALUE = numeric(1)
)
}

if (args$prop_spelling_mistakes > 0) {
# only apply spelling mistakes on character columns
char_cols <- names(linelist)[(vapply(linelist, is.character, logical(1)))]
linelist[char_cols] <- lapply(linelist[char_cols], function(col) {
misspell <- runif(length(col)) < args$prop_spelling_mistakes
ifelse(
test = misspell,
yes = vapply(col, .spelling_mistake, character(1)),
no = col
)
})
}

# call after prop_spelling_mistakes to not create mistakes to numeric chars
if (args$numeric_as_char) {
numeric_col <- vapply(linelist, is.numeric, FUN.VALUE = logical(1))
linelist[, numeric_col] <- vapply(
linelist[, numeric_col],
as.character,
FUN.VALUE = character(nrow(linelist))
)
}

# call after prop_spelling_mistakes to not create mistakes to date chars
if (args$date_as_char) {
date_col <- vapply(
linelist, inherits, FUN.VALUE = logical(1), what = "Date"
)
linelist[, date_col] <- vapply(
linelist[, date_col],
as.character,
FUN.VALUE = character(nrow(linelist))
)
}

if (args$inconsistent_dates) {
date_col <- startsWith(colnames(linelist), "date_")
date_fmt <- sample(
c("%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y", "%d/%m/%Y", "%d %B %Y", "%d %b %Y"), # nolint nonportable_path_linter
size = nrow(linelist),
replace = TRUE
)
for (col in colnames(linelist[, date_col])) {
# format arg is is vectorised
linelist[, col] <- strftime(linelist[, col], format = date_fmt)
}
}

if (args$int_as_word) {
int_col <- vapply(linelist, is.integer, FUN.VALUE = logical(1))
linelist[, int_col] <- vapply(
linelist[, int_col],
english::words,
FUN.VALUE = character(nrow(linelist))
)
}

# random missingness introduced across <data.frame>
num_missing <- round(prod(dim(linelist)) * args$prop_missing)

# matrix of line list dimensions to sample unique index pairs
ll_dim_idx <- expand.grid(
row = seq_len(nrow(linelist)),
col = seq_len(ncol(linelist))
)

# sample line list index pairs
ll_idx <- ll_dim_idx[sample(nrow(ll_dim_idx), num_missing), ]

# set sampled index pairs to missing
for (i in seq_len(num_missing)) {
# check user-specified missing_value causing unwanted type coercion
linelist[ll_idx[i, 1], ll_idx[i, 2]] <- args$missing_value
}

# return line list
linelist
}
31 changes: 31 additions & 0 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -268,3 +268,34 @@ as_function <- function(x) {
}
contact_infectious_period
}

#' Introduce spelling mistake in `character` string.
#'
#' @description
#' `.spelling_mistake()` only introduces spelling mistakes on strings with
#' more than one character.
#'
#' @details
#' A single letter is replaced at random, it is possible the letter is replaced
#' with the same letter not resulting in a spelling mistake.
#'
#' @param char A single `character` string.
#'
#' @return A single `character` string.
#' @keywords internal
#' @noRd
.spelling_mistake <- function(char) {
checkmate::assert_string(char)
if (nchar(char) < 2) return(char)
chars <- strsplit(char, "", fixed = TRUE)[[1]]
n_chars <- length(chars)
letter_idx <- sample(seq_len(n_chars), size = 1)
chars[letter_idx] <- ifelse(
test = letter_idx == 1,
yes = sample(LETTERS, 1),
no = sample(letters, 1)
)
char <- paste(chars, collapse = "")
# return character string
char
}
3 changes: 3 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ citation("simulist")
:package: :left_right_arrow: :package: [{epiparameter}](https://epiverse-trace.github.io/epiparameter/) <br>
:package: :left_right_arrow: :package: [{epicontacts}](https://www.repidemicsconsortium.org/epicontacts/) <br>
:package: :left_right_arrow: :package: [{incidence2}](https://www.reconverse.org/incidence2/) <br>
:package: :left_right_arrow: :package: [{cleanepi}](https://epiverse-trace.github.io/cleanepi/) <br>

## Related projects

Expand Down Expand Up @@ -253,3 +254,5 @@ Some packages are related to {simulist} but do not simulate line list data. Thes
- [`{epichains}`](https://github.com/epiverse-trace/epichains) an R package with functionality to simulate transmission chains using a branching process model.

The {outbreaks} package is useful if data from a past outbreak data or generic line list data is required. The {ringbp} and {epichains} packages can be used to generate case data over time which can then be converted into a line list with some manual post-processing.

Another package for creating messy data is the [{messy}](https://CRAN.R-project.org/package=messy) package. This can be used, either independently or in combination with `messy_linelist()`, to create messy line list and contacts data.
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -387,7 +387,9 @@ citation("simulist")
:package: :left_right_arrow: :package:
[{epicontacts}](https://www.repidemicsconsortium.org/epicontacts/) <br>
:package: :left_right_arrow: :package:
[{incidence2}](https://www.reconverse.org/incidence2/) <br>
[{incidence2}](https://www.reconverse.org/incidence2/) <br> :package:
:left_right_arrow: :package:
[{cleanepi}](https://epiverse-trace.github.io/cleanepi/) <br>

## Related projects

Expand Down Expand Up @@ -473,6 +475,11 @@ generic line list data is required. The {ringbp} and {epichains}
packages can be used to generate case data over time which can then be
converted into a line list with some manual post-processing.

Another package for creating messy data is the
[{messy}](https://CRAN.R-project.org/package=messy) package. This can be
used, either independently or in combination with `messy_linelist()`, to
create messy line list and contacts data.

[^1]: In this context *Parameterised with epi distributions* means that
the simulation uses epidemiological distributions (e.g. serial
interval, infectious period) to parameterise the model and the
Expand Down
4 changes: 4 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ reference:
contents:
- truncation

- title: Messy line list
contents:
- messy_linelist

- title: Helper functions
contents:
- create_config
Expand Down
Loading
Loading