prepepi: tools for cleaning and preparing epidemiological data #42

thibautjombart · 2022-09-21T16:34:46Z

thibautjombart
Sep 21, 2022

prepepi: tools for cleaning epidemiological data

Description

This package would provide tools for facilitating the cleaning and preparation of epidemiological data. It is mostly made of wrappers around existing tools and would re-implement several features of the old RECON package linelist, which was never finished nor released.

It would provide the following features:

date cleaning, including automated detection of date formats, based on lubridate; this would reimplement features from linelist::clean_dates
cleaning of labels and character string using janitor, replicating features from linelist::clean_variable_labels and linelist::clean_variable_names; should include the option to replace accents and non-ascii characters
dictionary based data cleaning using matchmaker, replicating features from linelist::clean_variables
anonymising data using hashing algorithms, extending and polishing epitrix::hash_names (which would then be deprecated)

Target audience

typical end-users: anyone having to clean up epidemiological data
potential contributors: same as the end-users; user feedback is likely to point to common use-cases which may result in new features
key collaborators: field epidemiologists; people with dirty data!

Interoperability

inputs: a data.frame (or tibble) of dirty data
outputs: a data.frame (or tibble) of clean data
related projects
- the old RECON package linelist, which implemented a mixed bag of tools for data storage and cleaning
- the new linelist on Epiverse-TRACE, which only implements tools for data storage
- this package would provide mostly wrappers; we would need to be cristal-clear about where functionalities come from, and provide appropriate links on the website and in the documentation

Usage

The code below illustrates a typical use of the package, using fictitious code and outputs if needed:

library(tidyverse)
library(prepepi)

# Use case 1: detailed cleaning with options 
raw_data <- rio::import("some_linelist.xlsx") %>%
  tibble()
dict <- rio::import("cleaning_dictionary.xlsx") %>%
  tibble()
clean_data <- raw_data %>%
  clean_variable_names() %>%
  clean_labels(
    lower = TRUE, # set to lower case
    force_ascii = FALSE # keep non-ascii characters, 
    dictionary = dict) 
  clean_dates()

# Use case 2: a default set of cleaning tools and hashing
clean_data <- raw_data %>%
  clean_data(dictionary = dict) %>%
  anonymise_data(
    variables = c("first_name", "last_name", "dob", "gender")) # produce hash labels using input fields which are then removed

thibautjombart · 2022-09-21T16:46:31Z

thibautjombart
Sep 21, 2022
Author

Cons: a lot of this can done by using the packages referred to above, and we could merely provide documentation for these approaches
Pros: it would provide all these tools in one place, with sensible defaults which may become standards; there has been a few requests (DM or RECON forum) re the original linelist package, which seemed well-liked, and is currently used in the R epi handbook

I would really like to get a feel for pros/cons from the community. Thumbs up or down or comment welcome!

0 replies

thibautjombart · 2022-09-21T16:48:36Z

thibautjombart
Sep 21, 2022
Author

Note: I have thought of calling this 'epiclean' or 'cleanepi', but I think it is useful to have other basic tools in there for preparing the data (hashing algo), which makes it no just about cleaning.

0 replies

nsbatra · 2022-09-26T12:52:55Z

nsbatra
Sep 26, 2022

Hey Thibaut,
Thanks for bringing this up. I think this question reflects the constant tension between offering a wrapper vs. just sending folks to the underlying packages.

In our courses so far, we have had no problems sending students directly to {janitor} for column name cleaning (and they are using janitor anyway for tabyl() as their go-to function for quick tabulations). We have them use lubridate for many purposes, and for really messy dates we refer them to {parsedate}. Similar story for {matchmaker}. I don't have much experience with the hashing, so I won't comment on that.

In the Epi R Handbook we use janitor::clean_names(). Due to the above, we're thinking to shift the Epi R Handbook messy dates to parsedate, and the dictionary-based cleaning to matchmaker, but haven't made the move yet.

I think your pros and cons are well laid out. Right now I'd be on the side of just having good public health R user documentation/help for using these underlying packages. But if you can compile a large enough set of gaps not met by these packages, and since now there are more resources available to upkeep new packages... perhaps...

happy to talk more

0 replies

thibautjombart · 2022-09-26T13:06:42Z

thibautjombart
Sep 26, 2022
Author

Hi Neale,
great, thanks for your feedback, it echoes what I suspected but good to hear it first-hand from someone extensively involved in training field epis. On the hashing / data anonymisation: this could merely be a small self-contained package.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epiverse-TRACE

prepepi: tools for cleaning and preparing epidemiological data #42

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Epiverse-TRACE

prepepi: tools for cleaning and preparing epidemiological data #42

thibautjombart Sep 21, 2022

prepepi: tools for cleaning epidemiological data

Description

Target audience

Interoperability

Usage

Replies: 0 comments · 4 replies

thibautjombart Sep 21, 2022 Author

thibautjombart Sep 21, 2022 Author

nsbatra Sep 26, 2022

thibautjombart Sep 26, 2022 Author

thibautjombart
Sep 21, 2022

Replies: 0 comments 4 replies

thibautjombart
Sep 21, 2022
Author

thibautjombart
Sep 21, 2022
Author

nsbatra
Sep 26, 2022

thibautjombart
Sep 26, 2022
Author