-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_synthetic-data.qmd
532 lines (358 loc) · 20.1 KB
/
02_synthetic-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
---
title: "Synthetic Data"
date: today
output:
html:
toc: true
embed-resources: true
code-line-numbers: true
editor_options:
chunk_output_type: console
execute:
warning: false
message: false
bibliography: references.bib
---
```{=html}
<style>
@import url('https://fonts.googleapis.com/css?family=Lato&display=swap');
</style>
```
```{r}
#| label: setup2
#| echo: false
library(tidyverse)
library(palmerpenguins)
library(kableExtra)
library(gt)
library(urbnthemes)
set_urbn_defaults(style = "print")
options(scipen = 999)
source(here::here("R", "create_table.R"))
```
```{r}
#| echo: false
exercise_number <- 1
```
::: {.callout-tip}
## Synthetic data
**Synthetic data** consists of pseudo or “fake” records that can be statistically representative of the confidential data.
:::
- The goal of most syntheses is to closely mimic the underlying distribution and statistical properties of the real data to preserve data utility while minimizing disclosure risks.
- Synthesized values also limit an intruder's confidence, because they cannot confirm a synthetic value exists in the confidential dataset.
- Synthetic data may be used as a “training dataset” to develop programs to run on confidential data via a validation server.
::: {.callout-tip}
## Partially synthetic
**Partially synthetic** data only synthesizes some observations or variables in the released data (generally those most sensitive to disclosure). In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records.
:::
In @fig-partial, we see an example of what a partially synthesized version of the above confidential data could look like.
![Partially synthetic data](www/images/partially-synthetic-data.png){#fig-partial width="550"}
::: {.callout-tip}
## Fully synthetic
**Fully synthetic** data synthesizes all values in the dataset with imputed amounts. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative. Since fully synthetic data does not contain any actual observations, it protects against both attribute and identity disclosure.
:::
In @fig-fully, we see an example of what a fully synthesized version of the confidential looks like.
![Fully synthetic data](www/images/fully-synthetic-data.png){#fig-fully width="550"}
## `r paste("Exercise", exercise_number)`
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
Consider the penguins data from earlier.
```{r}
#| echo: false
set.seed(125)
ex_data <- penguins |>
select(species, bill_length_mm, sex) |>
slice_sample(n = 5)
ex_data |>
create_table()
```
Let's say that researchers decide that the `sex` of the penguins in the data are not confidential, but the `species` and `bill length` are. So, they develop regression models that predict `species` conditional on `sex` and predict `bill_length` conditional on `species` and `sex`. They then use those models to predict species and bill lengths for each row in the data and then release it publicly.
::: {.panel-tabset}
### <font color="#55b748">Question</font>
*What specific Statistical Disclosure Control method are these researchers using?*
### <font color="#55b748">Solution</font>
*What specific Statistical Disclosure Control method are these researchers using?*
They are using partially synthetic data.
:::
## `r paste("Exercise", exercise_number)`
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
::: {.panel-tabset}
### <font color="#55b748">Question</font>
*A researcher has confidential data on a population. To protect the privacy of respondents, the researcher releases a synthetic version of the data. A data attacker then runs a record linkage attack against the synthetic data and is able to accurately identify 5 individuals in the data. Based on this information, can you tell whether the researcher released fully or partially synthetic data? Why or why not?*
### <font color="#55b748">Answer</font>
*A researcher has confidential data on a population. To protect the privacy of respondents, the researcher releases a synthetic version of the data. A data attacker then runs a record linkage attack against the synthetic data and is able to accurately identify 5 individuals in the data. Based on this information, can you tell whether the researcher released fully or partially synthetic data? Why or why not?*
Record linkage attacks are only possible for partially synthetic data, though other types of disclosure risk still apply to fully synthetic data.
:::
## Synthetic Data <-> Imputation Connection
- Multiple imputation was originally developed to address non-response problems in surveys [@rubin1977formalizing].
- Statisticians created new observations or values to replace the missing data by developing a model based on other available respondent information.
- This process of replacing missing data with substituted values is called **imputation**.
### Imputation Example
```{r, echo = FALSE, setup}
# set Urban Institute data visualization styles
#set_urbn_defaults(base_size = 12)
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Create data of conference attendees, where half are missing age
sample_conf <- tibble(
attendee_number = 1:80,
age = c(round(rnorm(n = 40, mean = 46, sd = 13), 0), rep(NA, 40))
)
```
Imagine you are running a conference with 80 attendees. You are collecting names and ages of all your attendees. Unfortunately, when the conference is over, you realize that only about half of the attendees listed their ages. One common imputation technique is to just replace the missing values with the mean age of those in the data.
@fig-histogram-before shows the distribution of the 40 age observations that are not missing.
```{r, echo = FALSE, fig.height = 3.5, before_hist}
#| label: fig-histogram-before
# plot attendee ages
ggplot(sample_conf, aes(x = age)) +
geom_histogram(binwidth = 5) +
labs(title = 'Histogram of attendee ages')
# replace NA values with mean age
sample_conf <- sample_conf |>
mutate(
age = if_else(
condition = is.na(age),
true = round(mean(age, na.rm = TRUE), 0),
false = age
)
)
```
@fig-histogram-after shows the histogram after imputation.
```{r, echo = F, fig.height = 3.5, after_hist}
#| label: fig-histogram-after
# replot the histogram
ggplot(sample_conf, aes(x = age)) +
geom_histogram(binwidth = 5) +
labs(title = 'Histogram of attendee ages (with missing values imputed)')
```
- Using the mean to impute the missing ages removes useful variation and conceals information from the "tails" of the distribution.
- Simply put, we used a straightforward model (replace the data with the mean) and sampled from that model to fill in the missing values.
- When creating synthetic data, this process is repeated for an entire variable, or set of variables.
- In a sense, the entire column is treated as missing!
## Sequential Synthesis
A more advanced implementation of synthetic data generation estimates models for each predictor with previously synthesized variables used as predictors. This iterative process is called **sequential synthesis** or **fully conditional specification (FCS)**. This allows us to easily model multivariate relationships (or joint distributions) without being computationally expensive.
The process described above may be easier to understand with the following table:
```{r, echo = FALSE}
table = tribble(~Step, ~Outcome, ~`Modelled with`, ~`Predicted with`,
"1", "Sex", NA, "Random sampling with replacement",
"2", "Age", "Sex", "Sampled Sex",
"3", "Social Security Benefits","Sex, Age" , "Sampled Sex, Sampled Age",
NA, NA, NA, NA,
)
table |>
create_table()
```
- We can select the synthesis order based on the priority of the variables or the relationships between them.
- The earlier in the order a variable is synthesized, the better the original information is preserved in the synthetic data **usually**.
- [@bowen2021differentially] proposed a method that ranks variable importance by either practical or statistical utility and sequentially synthesizes the data accordingly.
## Parametric vs. Nonparametric Data Generation Process
**Parametric data synthesis** is the process of data generation based on a parametric distribution or generative model.
- Parametric models assume a finite number of parameters that capture the complexity of the data.
- They are generally less flexible, but more interpretable than nonparametric models.
- Examples: regression to assign an age variable, sampling from a probability distribution, Bayesian models, or copula based models.
**Nonparametric data synthesis** is the process of data generation that is *not* based on assumptions about an underlying distribution or model.
- Often, nonparametric methods use frequency proportions or marginal probabilities as weights for some type of sampling scheme.
- They are generally more flexible, but less interpretable than parametric models.
- Examples: assigning gender based on underlying proportions, CART (Classification and Regression Trees) models, RNN models, etc.
**Important:** Synthetic data are only as good as the models used for imputation!
## Implicates
- Researchers can create any number of versions of a partially synthetic or fully synthetic dataset. Each version of the dataset is called an **implicate**. These can also be referred to as replicates or simply "synthetic datasets"
- For partially synthetic data, non-synthesized variables are the same across each version of the dataset.
- Multiple implicates are useful for understanding the uncertainty added by imputation and are required for calculating valid standard errors.
- More than one implicate can be released for public use; each new release, however, increases disclosure risk (but allows for more complete analysis and better inferences, provided users use the correct combining rules).
- Implicates can also be analyzed internally to find which version(s) of the dataset provide the most utility in terms of data quality.
## `r paste("Exercise", exercise_number)`: Sequential Synthesis
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
::: {.panel-tabset}
#### <font color="#55b748">**Question**</font>
You have a confidential dataset that contains information about dogs' `weight` and their `height`. You decide to sequentially synthesize these two variables and write up your method below. Can you spot the mistake in writing up your method?
> To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential `height` is regressed on the synthetic `weight`. Using the resulting regression coefficients, a synthetic `height` variable is generated for each row in the data using just the synthetic `weight` values as an input.
#### <font color="#55b748">**Answer**</font>
You have a confidential dataset that contains information about dogs' `weight` and their `height`. You decide to sequentially synthesize these two variables and write up your method below. Can you spot the mistake in writing up your method?
> To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential `height` is regressed on the synthetic `weight`. Using the resulting regression coefficients, a synthetic `height` variable is generated for each row in the data using just the synthetic `weight` values as an input.
**`Height` should be regressed on the confidential values for `weight`, rather than the synthetic values for `weight`**
:::
## `r paste("Exercise", exercise_number)`: Multiple Implicates
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
::: {.panel-tabset}
#### <font color="#55b748">**Question**</font>
*What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?*
#### <font color="#55b748">**Notes**</font>
*What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?*
- Releasing multiple implicates improves transparency and analytical value, but increases disclosure risk (violates "security through obscurity").
- It is more risky to release partially synthetic implicates, since non-synthesized records are the same across each dataset and there remains a 1-to-1 relationship between confidential and synthesized records.
:::
## `r paste("Exercise", exercise_number)`: Partial vs. fully synthetic
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
Shown here are the first seven rows of a dataset about the prices and attributes of diamonds. Suppose you decide to synthesize the "price" variable, because that information is too sensitive for public release.
| price | carat | cut | color | clarity |
| ----- | ----- | --------- | ----- | ------- |
| 326 | 0.23 | Ideal | E | SI2 |
| 326 | 0.21 | Premium | E | SI1 |
| 327 | 0.23 | Good | E | VS1 |
| 334 | 0.29 | Premium | I | VS2 |
| 335 | 0.31 | Good | J | SI2 |
| 336 | 0.24 | Very Good | J | VVS2 |
| 336 | 0.24 | Very Good | I | VVS1 |
::: {.panel-tabset}
#### <font color="#55b748">**Question**</font>
*After you synthesize the price variable, would the resulting dataset be considered partially or fully synthetic?*
*What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?*
*Describe in words how you would synthesize the "price" variable. Is the method you described parametric or non-parametric? Why?*
#### <font color="#55b748">**Notes**</font>
*After you synthesize the price variable, would the resulting dataset be considered partially or fully synthetic?*
Partially synthetic
*What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?*
- Changing only some variables (partial synthesis) in general leads to higher utility in analysis since the relationships between variables are by definition unchanged (Drechsler et al, 2008).
- Disclosure in fully synthetic data is challenging because all values are imputed, while partial synthesis has higher disclosure risk since confidential values remain in the dataset (Drechsler et al, 2008).
- Note that while the risk of disclosure for fully synthetic data is very low, it is not zero.
- Accurate and exhaustive specification of variable relationships and constraints in fully synthetic data is difficult and if done incorrectly can lead to bias [@drechslerjorgcomparingsynthetic].
- If a variable is synthesized incorrectly early in a sequential synthesis, all variables synthesized on the basis of that variable will be affected.
- Partially synthetic data may be publicly perceived as more reliable than fully synthetic data.
*Describe in words how you would synthesize the "price" variable. Is the method you described parametric or non-parametric? Why?*
:::
## `r paste("Exercise", exercise_number)`
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
For this exercise, we will use the `starwars` dataset from the `dplyr` package. We will practice sequentially synthesizing a binary variable (`gender`) and a numeric variable (`height`).
```{r}
# run this to get the dataset we will work with
starwars <- dplyr::starwars |>
select(gender, height) |>
drop_na()
starwars |>
head() |>
create_table()
```
<font color="#55b748">**Part 1: Gender synthesis**</font>
::: {.panel-tabset}
### Template
Fill in the blanks in the following code to synthesize the `gender` variable using the underlying distribution present in the data.
```{r eval = FALSE}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# vector of gender categories
gender_categories <- c("feminine", "masculine")
# size of sample to generate
synthetic_data_size <- nrow(starwars)
# probability weights
gender_probs <- starwars |>
count(gender) |>
mutate(relative_frequency = ### ______) |>
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
gender_synthetic <- sample(
x = ###_____,
size = ###_____,
replace = ###_____,
prob = ###_____
)
# create starwars_synthetic dataset using generated variable
starwars_synthetic <- tibble(
gender = gender_synthetic
)
```
<font color="#55b748">**Part 2: Height synthesis**</font>
Similarly, fill in the blanks in the code to generate the `height` variable using a linear regression with `gender` as a predictor.
```{r eval = FALSE}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# linear regression
height_lm <- lm(
formula = ###_____,
data = ###______
)
# predict flipper length with model coefficients
height_predicted <- predict(
object = height_lm,
newdata = ###_____
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
height_synthetic <- rnorm(
n = ###_______,
mean = ###______,
sd = ###______
)
# add new values to synthetic data as height
starwars_synthetic <- starwars_synthetic |>
mutate(height = height_synthetic)
```
### Solutions
Fill in the blanks in the following code to synthesize the `gender` variable using the underlying distribution present in the data.
```{r eval = FALSE}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# vector of gender categories
gender_categories <- c("feminine", "masculine")
# size of sample to generate
synthetic_data_size <- nrow(starwars)
# probability weights
gender_probs <- starwars |>
count(gender) |>
mutate(relative_frequency = n / synthetic_data_size) |>
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
gender_synthetic <- sample(
x = gender_categories,
size = synthetic_data_size,
replace = TRUE,
prob = gender_probs
)
# create starwars_synthetic dataset using generated variable
starwars_synthetic <- tibble(
gender = gender_synthetic
)
```
<font color="#55b748">**Part 2: Height synthesis**</font>
Similarly, fill in the blanks in the code to generate the `height` variable using a linear regression with `gender` as a predictor.
```{r eval = FALSE}
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# linear regression
height_lm <- lm(
formula = height ~ gender,
data = starwars
)
# predict flipper length with model coefficients
height_predicted <- predict(
object = height_lm,
newdata = starwars_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
height_synthetic <- rnorm(
n = synthetic_data_size,
mean = height_predicted,
sd = sigma(height_lm)
)
# add new values to synthetic data as height
starwars_synthetic <- starwars_synthetic |>
mutate(height = height_synthetic)
```
:::
## Suggested Reading - Synthetic Data
Snoke, J., Raab, G. M., Nowok, B., Dibben, C., & Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3), 663-688.
[link](https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12358)
Bowen, C. M., Bryant, V., Burman, L., Czajka, J., Khitatrakun, S., MacDonald, G., ... & Zwiefel, N. (2022). Synthetic Individual Income Tax Data: Methodology, Utility, and Privacy Implications. In International Conference on Privacy in Statistical Databases (pp. 191-204). Springer, Cham.
[link](https://link.springer.com/chapter/10.1007/978-3-031-13945-1_14)
Raghunathan, T. E. (2021). Synthetic data. Annual Review of Statistics and Its Application, 8, 129-140.
[link](https://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-040720-031848)