-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
237 lines (178 loc) · 7.74 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r readmesetup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# **linelist**: Tagging and Validating Epidemiological Data <img src="man/figures/logo.svg" align="right" width="120" />
<!-- badges: start -->
[![Digital Public Good](https://raw.githubusercontent.com/epiverse-trace/linelist/main/man/figures/dpg_badge.png)](https://digitalpublicgoods.net/registry/linelist.html)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/mit)
[![cran-check](https://badges.cranchecks.info/summary/linelist.svg)](https://cran.r-project.org/web/checks/check_results_linelist.html)
[![R-CMD-check](https://github.com/epiverse-trace/linelist/workflows/R-CMD-check/badge.svg)](https://github.com/epiverse-trace/linelist/actions)
[![codecov](https://codecov.io/gh/epiverse-trace/linelist/branch/main/graph/badge.svg?token=JGTCEY0W02)](https://app.codecov.io/gh/epiverse-trace/linelist)
[![lifecycle-experimental](https://raw.githubusercontent.com/reconverse/reconverse.github.io/master/images/badge-maturing.svg)](https://www.reconverse.org/lifecycle.html#maturing)
[![month-download](https://cranlogs.r-pkg.org/badges/linelist)](https://cran.r-project.org/package=linelist)
[![total-download](https://cranlogs.r-pkg.org/badges/grand-total/linelist)](https://cran.r-project.org/package=linelist)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6532786.svg)](https://doi.org/10.5281/zenodo.6532786)
<!-- badges: end -->
*linelist* provides a safe entry point to the *Epiverse* software ecosystem,
adding a foundational layer through *tagging*, *validation*, and *safeguarding*
epidemiological data, to help make data pipelines more straightforward and
robust.
## Installation
### Stable version
Our stable versions are released on CRAN, and can be installed using:
```{r, eval=FALSE}
install.packages("linelist", build_vignettes = TRUE)
```
::: {.pkgdown-devel}
### Development version
The development version of linelist can be installed from
[GitHub](https://github.com/) with:
```{r, eval=FALSE}
if (!require(pak)) {
install.packages("pak")
}
pak::pak("epiverse-trace/linelist")
```
:::
## Usage
```{r}
#| fig.alt: "Graphical summary of the linelist R package, with emphasis of these 4 key features: 1. Tag key epi variables, 2. Validate tagged data, 3. Safeguards vs accidental loss / alteration, 4. Robust data for stronger pipelines](man/figures/linelist_infographics.png"
#| out.width: "60%"
knitr::include_graphics("man/figures/linelist_infographics.png")
```
linelist works by tagging key epidemiological data in a `data.frame` or a
`tibble` to facilitate and strengthen data pipelines. The resulting object is a
`linelist` object, which extends `data.frame` (or `tibble`) by providing three
types of features:
1. a **tagging system** to identify key data, enabling access to these data using
their tags rather than actual names, which may change over time and across
datasets
2. **validation** of the tagged variables (making sure they are present and of the
right type/class)
3. **safeguards** against accidental losses of tagged variables in common data
handling operations
The short example below illustrates these different features. See the
[Documentation](#documentation) section for more in-depth examples and details
about `linelist` objects.
```{r}
# load packages and a dataset for the example
# -------------------------------------------
library(linelist)
library(dplyr)
dataset <- outbreaks::mers_korea_2015$linelist
head(dataset)
# check known tagged variables
# ----------------------------
tags_names()
# build a linelist
# ----------------
x <- dataset %>%
tibble() %>%
make_linelist(
date_onset = "dt_onset", # date of onset
date_reporting = "dt_report", # date of reporting
occupation = "age" # mistake
)
x
tags(x) # check available tags
```
`validate_linelist()` will error if one of your tagged column doesn't have the
correct type:
```{r, error = TRUE}
# validation of tagged variables
# ------------------------------
## (this flags a likely mistake: occupation should not be an integer)
validate_linelist(x)
```
```{r}
# change tags: fix mistakes, add new ones
# ---------------------------------------
x <- x %>%
set_tags(
occupation = NULL, # tag removal
gender = "sex", # new tag
outcome = "outcome"
)
# safeguards against actions losing tags
# --------------------------------------
## attemping to remove geographical info but removing dates by mistake
x_no_geo <- x %>%
select(-(5:8))
```
For stronger pipelines, you can even trigger errors upon loss:
```{r error = TRUE}
lost_tags_action("error")
x_no_geo <- x %>%
select(-(5:8))
x_no_geo <- x %>%
select(-(5:7))
## to revert to default behaviour (warning upon error)
lost_tags_action()
```
Alternatively, content can be accessed by tags:
```{r}
x_no_geo %>%
select(has_tag(c("date_onset", "outcome")))
x_no_geo %>%
tags_df()
```
linelist can also be connected to the incidence2 package for pipelines focused
on aggregated count data:
```{r, fig.width=8, fig.height=6, fig.alt="Epicurves (daily incidence) by sex and outcome via the incidence2 R package."}
library(incidence2)
x_no_geo %>%
tags_df() %>%
incidence("date_onset", groups = c("gender", "outcome")) %>%
plot(
fill = "outcome",
angle = 45,
nrow = 2,
border_colour = "white",
legend = "bottom"
)
```
## Documentation
More detailed documentation can be found at:
https://epiverse-trace.github.io/linelist/
In particular:
* [A general introduction to linelist](https://epiverse-trace.github.io/linelist/articles/linelist.html)
* [The reference manual](https://epiverse-trace.github.io/linelist/reference/index.html)
## Getting help
To ask questions or give us some feedback, please use the github
[issues](https://github.com/epiverse-trace/linelist/issues) system.
## Data privacy
Case line lists may contain personally identifiable information (PII). While
linelist provides a way to store this data in R, it does not currently provide
tools for data anonymization. The user is responsible for respecting individual
privacy and ensuring PII is handled with the required level of confidentiality,
in compliance with applicable laws and regulations for storing and sharing PII.
Note that PII is rarely needed for common analytics tasks, so that in many
instances it may be advisable to remove PII from the data before sharing them
with analytics teams.
## Development
### Lifecycle
This package is currently *maturing*, as defined by the [RECON software
lifecycle](https://www.reconverse.org/lifecycle.html). This means that essential
features and mechanisms are present and stable but minor breaking changes, or
function renames may still occur sporadically.
### Contributions
Contributions are welcome via [pull requests](https://github.com/epiverse-trace/linelist/pulls).
### Code of Conduct
Please note that the linelist project is released with a
[Code of Conduct](https://github.com/epiverse-trace/.github/blob/main/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.
### Notes
This package is a reboot of the RECON package
[linelist](https://github.com/reconhub/linelist). Unlike its predecessor, the
new package focuses on the implementation of a `linelist` class. The data
cleaning features of the original package will eventually be re-implemented for
`linelist` objects, albeit likely in a separate package.