03_disclosure_utility_metrics.qmd

---
title: "Utility and Disclosure Risk Metrics"
date: today
format:
  html:
    fig-cap-location: top
    number_sections: false
    embed-resources: true
    toc: true
    css: ../www/web_report.css
editor_options:
  chunk_output_type: console
execute:
  warning: false
  message: false
bibliography: references.bib
---

```{=html}
<style>
@import url('https://fonts.googleapis.com/css?family=Lato&display=swap');
</style>
```

```{r}
#| echo: false

exercise_number <- 1
```

```{r setup}
#| label: setup
#| echo: false

options(scipen = 999)

library(tidyverse)
library(gt)
library(palmerpenguins)
library(urbnthemes)
library(here)

set_urbn_defaults(style = "print")

source(here::here("R", "create_table.R"))

```

## Review

::: {.panel-tabset}

### Question 1

*What's the difference between partially synthetic data and fully synthetic data?*

### Question 1 Notes

*What's the difference between partially synthetic data and fully synthetic data?*

**Partially synthetic data** contains unaltered and synthesized variables. In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records.

**Fully synthetic data** only contains synthesized variables. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative. 

:::

::: {.callout-tip}
## Sequential synthesis

In a perfect world, we would synthesize data by directly modeling the joint distribution of the variables of interest. Unfortunately, this is often computationally infeasible. 

Instead, we often decompose a joint distribution into a sequence of conditional distributions. 
:::

::: {.panel-tabset}

### Question 2

*What's the difference between specific utility and general utility?*

### Question 2 Notes

*What's the difference between specific utility and general utility?*

**Specific Utility** measures the similarity of results for a specific analysis (or analyses) of the confidential and altered data (e.g., comparing the coefficients in regression models).

**General Utility** measures the univariate and multivariate distributional similarity between the confidential data and the altered data (e.g., sample means, sample variances, and the variance-covariance matrix).

:::

## General Utility Metrics

-   As a refresher, general utility metrics measure the distributional similarity (i.e., all statistical properties) between the original and synthetic data.

-   General utility metrics are useful because they provide a sense of how "fit for use" synthetic data is for analysis without making assumptions about the uses of the synthetic data.

## Univariate

-   **Categorical variables:** frequencies, relative frequencies

-   **Numeric variables** means, standard deviations, skewness, kurtosis (i.e., first four moments), percentiles, and number of zero/non-zero values

![](www/images/puf_mean_example.png){width="469"}

![](www/images/compare_sds.png){width="436"}

-   It is also useful to visually compare univariate distributions using histograms (@fig-histogram), density plots (@fig-density), and empirical cumulative distribution function plots (@fig-ecdf).

```{r, echo = FALSE, fig.height = 3.5}
compare_penguins <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv"))

```

```{r}
#| label: fig-histogram
#| fig-cap: Compare Synthetic and Confidential Distributions with Histograms
#| fig-height: 3.5

compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_histogram(alpha = 0.3, color = NA, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  scatter_grid()

```

```{r}
#| label: fig-density
#| fig-cap: Compare Synthetic and Confidential Distributions with Density Plots
#| fig-height: 3.5

compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3, color = NA) +
  facet_wrap(~variable, scales = "free") +
  scatter_grid()

```

```{r}
#| label: fig-ecdf
#| fig-cap: Compare Synthetic and Confidential Distributions with Empirical CDF Plots
#| fig-height: 3.5

compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, color = data_source)) +
  stat_ecdf() +
  facet_wrap(~ variable, scales = "free") +
  scatter_grid()

```


## Bivariate

::: {.callout-tip}
## Correlation Fit

**Correlation fit** measures how well the synthesizer recreates the linear relationships between variables in the confidential dataset.
:::

-   Create correlation matrices for the synthetic data and confidential data. Then measure differences across synthetic and actual data. Those differences are often summarized across all variables using [L1](https://en.wikipedia.org/wiki/Taxicab_geometry) or [L2](https://en.wikipedia.org/wiki/Euclidean_distance) distance. 

![Correlation Difference](www/images/puf_correlation_fit_example.png){#fig-corrdiff}

* @fig-corrdiff shows the creation of a difference matrix. Let's summarize the difference matrix using mean absolute error. This will give us a sense of how off the average correlation will be in the synthetic data compared to the confidential data.

$$MAE_{dist} = \frac{1}{n}\sum_{i = 1}^n |dist|$$

$$MAE_{dist} = \frac{1}{6} \left(|-0.15| + |0.01| + |0.1| + |-0.15| + |0.15| + |0.02|\right) \approx `r mean(abs(c(-0.15, 0.01, 0.1, -0.15, 0.15, 0.02)))`$$

-   Advanced measures like *relative mutual information* can be used to measure the relationships between categorical variables. 

## Multivariate

::: {.callout-tip}
## Discriminant Based Methods

**Discriminant based methods** measure well a predictive model can distinguish (i.e., discriminate) between records from the confidential and synthetic data. 
:::

-   The confidential data and synthetic data should theoretically be drawn from the same super population. 

-   The basic idea is to combine (stack) the confidential data and synthetic data and see how well a predictive model distinguish (i.e., discriminate) between synthetic observations and confidential observations.

-   An inability to distinguish between the records suggests a good synthesis. 

-   It is possible to use logistic regression for the predictive modeling, but decision trees, random forests, and boosted trees are more common. 

-   @fig-discriminant shows three discriminant based metrics calculated on a good synthesis and a poor synthesis. 

::: {#fig-discriminant layout-nrow=2}

![Good Synthesis](www/images/same_population_general_utility_metrics.png){width=518}

![Poor Synthesis](www/images/both_axis_different_general_utility_metrics.png){width=518}

A comparison of discriminant metrics on a good synthesis and a poor synthesis
:::

### Calculating Discriminant Metrics

-   pMSE ratio, SPECKS, and AUC all require calculating propensity scores (i.e., the probability that a particular data point belongs to the confidential data) and start with the same step. 

1)  *Combine the synthetic and confidential data. Add an indicator variable with 0 for the confidential data and 1 for the synthetic data*

```{r, echo = FALSE}
  
set.seed(1297)
      
x = penguins |>
  select(species, bill_length_mm, sex) |> 
  sample_n(2) |> 
  add_row(.before = 2) |> 
  mutate(ind = c(0, NA, 1))
      
x |> 
  create_table() |> 
  fmt_missing(columns = everything(), 
              missing_text = "...") |> 
  tab_style(cell_fill(color = palette_urbn_main["cyan"], alpha = 0.3), 
                    locations = cells_body(columns = ind))
```

2)  *Calculate propensity scores (i.e., probabilities for group membership) for whether a given row belong to the synthetic dataset.*

```{r, echo = FALSE}
set.seed(1297)
    
    
x |> 
  mutate(prop_score = c(0.32, NA, 0.64)) |> 
  create_table() |> 
  fmt_missing(columns = everything(), 
              missing_text = "...") |> 
  tab_style(cell_fill(color = palette_urbn_main["cyan"], alpha = 0.3), 
                    locations = cells_body(columns = prop_score))
```

::: {.panel-tabset}

### pMSE

-   **pMSE**: Calculates the average Mean Squared Error (MSE) between the propensity scores and the expected probabilities:

-   Proposed by Woo et al. [@woo2009global] and enhanced by Snoke et al. [@snoke_raab_nowok_dibben_slavkovic_2018]

-   After doing steps 1) and 2) above:

    3)  *Calculate expected probability, i.e., the share of synthetic data in the combined data.* In the cases where the synthetic and confidential datasets are of equal size, this will always be 0.5.

        ```{r, echo = FALSE}
        set.seed(1297)
            
            
                                x |> 
          mutate(prop_score = c(0.32, NA, 0.64),
                 exp_prob = c(0.5, NA, 0.5)) |> 
          create_table() |> 
          fmt_missing(columns = everything(), 
                      missing_text = "...") |> 
          tab_style(cell_fill(color = palette_urbn_main["cyan"], alpha = 0.3), 
                    locations = cells_body(columns = exp_prob))
        ```

    4)  *Calculate pMSE, which is mean squared difference between the propensity scores and expected probabilities.*

    $$pMSE = \frac{(0.32 - 0.5)^2 + ... + (0.64-0.5)^2}{N} $$

-   Often people use the pMSE ratio, which is the average pMSE score across all records, divided by the null model [@snoke2018general].

-   The null model is the the expected value of the pMSE score under the best case scenario when the model used to generate the data reflects the confidential data perfectly.

-   pMSE ratio = 1 means that your synthetic data and confidential data are indistinguishable, although values this low are almost never achieved. 

### SPECKS

-   **SPECKS**: **S**ynthetic data generation; **P**ropensity score matching; **E**mpirical **C**omparison via the **K**olmogorov-**S**mirnov distance. 

After generating propensity scores (i.e., steps 1 and 2 from above), you:

3)  *Calculate the empirical CDF's of the propensity scores for the synthetic and confidential data, separately.*

4)  *Calculate the Kolmogorov-Smirnov (KS) distance between the 2 empirical CDFs.* The KS distance is the maximum vertical distance between 2 empirical CDF distributions.

![](www/images/ks_distance.png){width="251"}

### ROC Curves/AUC

-   **Receiver Operating Characteristic (ROC) curves** show the trade off between false positives and true positives. Area under the curve (AUC) is a single number summary of the ROC curve. 

AUC is a common tool for evaluating classification models. High values for AUC are bad because they suggest the model can distinguish between confidential and synthetic observations. 

After generating propensity scores (i.e., steps 1 and 2 from above),

![](www/images/roc_curve.png){width="572"}

-   In our context, ***High AUC*** = good at discriminating = ***poor synthesis***.

-   We want in the best case, AUC = 0.5 because that means the discriminator is no better than a random guess

:::

::: {.callout-warning}

Many predictive models for generating propensities can memorize chance features of the data used to fit the models. We suggest using a training/testing split and v-fold cross validations for hyperparameter tuning to compare in-sample and out-of-sample propensities and model accuracy. 
:::

-   Look at @fig-discriminant to see calculations for pMSE ratio, SPECKS, and AUC. 
-   It is useful to look at [variable importance](https://topepo.github.io/caret/variable-importance.html) for predictive models when observing poor discriminant based metrics. Variable importance can help diagnose which variables are poorly synthesized. 

## `r paste("Exercise", exercise_number)`: Using Utility Metrics

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

::: {.panel-tabset}

### Question

Consider the following two syntheses of `x`. 

*Which synthesis do you think is better?*

```{r}
#| echo: false

set.seed(20230710)
bind_rows(
  synth1 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, mean = 0.2)
  ),
  synth2 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, sd = 0.5)
  ),
  .id = "synthesis"
) |>
  pivot_longer(-synthesis, names_to = "variable") |>
  ggplot(aes(x = value, color = variable)) +
  stat_ecdf() +
  facet_wrap(~ synthesis) +
  scatter_grid()

```

**Both syntheses have issues? What do you think are the issues?**

### Notes

Consider the following two syntheses of `x`. **Which synthesis do you think is better?**

```{r}
#| code-fold: false

set.seed(20230710)
bind_rows(
  synth1 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, mean = 0.2)
  ),
  synth2 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, sd = 0.5)
  ),
  .id = "synthesis"
) |>
  pivot_longer(-synthesis, names_to = "variable") |>
  ggplot(aes(x = value, color = variable)) +
  stat_ecdf() +
  facet_wrap(~ synthesis) +
  scatter_grid()

```

**Both syntheses have issues? What do you think are the issues?**

* We consider `synth1` to be slightly better than `synth2` based on the large vertical distances between the lines for `synth2`.
* `synth1` looks to match the variance of the confidential data but the mean is a little too high. `synth2` matches the mean, but it contains far too little variance. There aren't enough observations in the tails of the synthetic data.

:::

## `r paste("Exercise", exercise_number)`: Correlation Difference

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

::: {.panel-tabset}

#### <font color="#55b748">**Question**</font>

Consider the following correlation matrices:

```{r}
#| echo: false

print("Synthetic")
mat_synth <- matrix(
  c(
    c(1, 0.5, 0.75),
    c(0.5, 1, 0.8),
    c(0.75, 0.8, 1)
  ),
  byrow = TRUE,
  nrow = 3
)
mat_synth

print("Confidential")
mat_conf <- matrix(
  c(
    c(1, 0.35, 0.1),
    c(0.35, 1, 0.9),
    c(0.1, 0.9, 1)
  ),
  byrow = TRUE,
  nrow = 3
)
mat_conf

```

* Construct the difference matrix
* Calculate MAE
* Optional: Calculate RMSE
* Optional: What is the main difference between MAE and RMSE?

#### <font color="#55b748">**Answer**</font>

```{r}
#| echo: false

print("Synthetic")
mat_synth

print("Confidential")
mat_conf

```

* Construct the difference matrix

```{r}
diff <- mat_synth - mat_conf

diff[!lower.tri(diff)] <- NA

diff

```

* Calculate MAE

```{r}
mean(abs(diff[lower.tri(diff)]))

```

* Optional: Calculate RMSE

```{r}
sqrt(mean(diff[lower.tri(diff)] ^ 2))

```

* Optional: What is the main difference between MAE and RMSE?

RMSE gives extra weight to large errors because it squares values instead of using absolute values. We like to think of this as the difference between the mean and the median error.

:::

## `r paste("Exercise", exercise_number)`: Correlation Difference

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

::: {.panel-tabset}

Part 1: Calculate the correlation fit between the synthetic and confidential data. Fill in the blanks and run the code below.

#### <font color="#55b748">**Question**</font>

```{r, eval = FALSE}
penguins_conf <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "confidential")

penguins_synth <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "synthetic")
# Fill in the blanks below:

# The cor() function can take in a dataframe and compute correlations 
# between all columns in the dataframe and spit out a correlation matrix
conf_data_corr <- cor(###)
synth_data_corr <- cor(###)

conf_data_corr <- conf_data_corr[lower.tri(conf_data_corr)]
synth_data_corr <- synth_data_corr[lower.tri(synth_data_corr)]
  
correlation_diff <- conf_data_corr - synth_data_corr

# Correlation fit is the sum of the sqrt of the squared differences between each correlation in the difference matrix.
cor_fit <- sum(sqrt( ### ^2))

cor_fit

```

#### <font color="#55b748">**Solution**</font>

```{r}
penguins_conf <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "confidential")

penguins_synth <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "synthetic")

# The cor() function can take in a dataframe and compute correlations 
# between all columns in the dataframe and spit out a correlation matrix
conf_data_corr <- cor(select(penguins_conf, where(is.numeric)))
synth_data_corr <- cor(select(penguins_synth, where(is.numeric)))

conf_data_corr <- conf_data_corr[lower.tri(conf_data_corr)]
synth_data_corr <- synth_data_corr[lower.tri(synth_data_corr)]
  
correlation_diff <- conf_data_corr - synth_data_corr

# Correlation fit is the sum of the sqrt of the squared differences between each correlation in the difference matrix.
cor_fit <- sum(sqrt(correlation_diff ^2))

cor_fit

```

:::

::: {.panel-tabset}

Part 2: Compare the univariate distributions for `mass` and `height` in the confidential and synthetic data using density plots. Fill in the blanks and run the code below.

#### <font color="#55b748">**Question**</font>

```{r,eval = FALSE}
conf_data <- read_csv(here::here("data/lesson_03_conf_data.csv"))
synth_data <- read_csv(here::here("data/lesson_03_synth_data.csv"))

combined_data <- bind_rows(
  "synthetic" = synth_data, 
  "confidential" = conf_data,
  .id = "type"
)

# Create a density plot of the mass distributions
combined_data %>% 
  ggplot(aes(x = ###,
             fill = type,),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

# Create a density plot of the height distributions
combined_data %>% 
  ggplot(aes(x = ###,
             fill = type,),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

```

#### <font color="#55b748">**Solution**</font>

```{r}
conf_data <- read_csv(here::here("data/lesson_03_conf_data.csv"))
synth_data <- read_csv(here::here("data/lesson_03_synth_data.csv"))

combined_data <- bind_rows(
  "synthetic" = synth_data, 
  "confidential" = conf_data,
  .id = "type"
)

# Create a density plot of the mass distributions
combined_data %>% 
  ggplot(aes(x = mass,
             fill = type),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

# Create a density plot of the height distributions
combined_data %>% 
  ggplot(aes(x = height,
             fill = type),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

```

:::

## Specific Utility Metrics

-   Specific utility metrics measure how suitable a synthetic dataset is for specific analyses.

-   These specific utility metrics will change from application to application, depending on common uses of the data.

-   A helpful rule of thumb: general utility metrics are useful for the data synthesizers to be convinced that they're doing a good job. Specific utility metrics are useful to convince downstream data users that the data synthesizers are doing a good job.

## Recreating Inferences

-   It can be useful to compare statistical analyses on the confidential data and synthetic data:
    - Do the estimates have the same sign?
    - Do the estimates have the same statistical inference at a common $\alpha$ level?
    - Do the confidence intervals for the estimates overlap?

- Each of these questions is useful. @barrientos_feasibility_2021 combine all three questions into sign, significance, and overlap (SSO) match. SSO is the proportion of times that intervals overlap and have the same sign and significance.

## Regression confidence interval overlap:

::: {.callout-tip}
### Regression Confidence Interval Overlap

**Regression confidence interval overlap** quantifies how well confidence intervals from estimates on the synthetic data recreate confidence intervals from the confidential data. 

1 indicates perfect overlap. 0 indicates intervals that are adjacent but not overlapping. Negative values indicate gaps between the intervals.

It is common the compare intervals from linear regression models and logistic regression models. 
:::

![](www/images/confidence_interval_overlap_ex.png){width="364"}


-   The interpretability of confidence interval overlap diminishes when disclosure control methods generate very wide confidence intervals. 

## Microsimulation results

-   The Urban Institute and Tax Policy Center are heavy users of microsimulation. 

-   When synthesizing administrative tax data, we compare microsimulation results from tax calculators applied to the confidential data and synthetic data. @fig-microsim shows results from the 2012 Synthetic Supplement PUF.

![microsim](www/images/microsimulation.png){#fig-microsim width=600}

-   @fig-microsim compares distributional output from baseline runs. It is also useful to compare tax reforms on the confidential and synthetic data.

## `r paste("Exercise", exercise_number)`: SSO

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

::: {.panel-tabset}

#### <font color="#55b748">**Question**</font>

Suppose we are interested in the following null and alternative hypotheses:

$$H_0: \mu = 0$$

$$H_a: \mu \ne 0$$

Consider the following output:

```{r}
#| echo: false

set.seed(20230709)

x_conf <- rnorm(20, mean = 3)

print(paste("Confidential Mean:", mean(x_conf)))

print("Confidendital Confidence Interval")
t.test(x_conf)$conf.int

x_synth <- rnorm(20, mean = 2)

print(paste("Synthetic Mean:", mean(x_synth)))

print("Synthetic Confidence Interval")
t.test(x_synth)$conf.int

```

**Do the synthetic data achieve SSO match?**

#### <font color="#55b748">**Solution**</font>

Suppose we are interested in the following null and alternative hypotheses:

$$H_0: \mu = 0$$

$$H_a: \mu \ne 0$$

Consider the following output:

```{r}
#| echo: false

set.seed(20230709)

x_conf <- rnorm(20, mean = 3)

print(paste("Confidential Mean:", mean(x_conf)))

print("Confidendital Confidence Interval")
t.test(x_conf)$conf.int

x_synth <- rnorm(20, mean = 2)

print(paste("Synthetic Mean:", mean(x_synth)))

print("Synthetic Confidence Interval")
t.test(x_synth)$conf.int

```

**Do the synthetic data achieve SSO match?**

Yes! The confidence intervals overlap, the signs are the same, and the statistical significance is the same. 

:::

## Disclosure Risk Metrics

We now pivot to evaluating the disclosure risks of synthetic data. 

## Identity Disclosure Metrics

::: {.callout-tip}
## Identity Disclosure Metrics

Identity disclosure metrics evaluate how often we correctly re-identify confidential records in the synthetic data. 

**Note:** These metrics require major assumptions about attacker information.
:::

-   For fully synthetic datasets, there is no one to one relationship between individuals and records so identity disclosure risk is ill-defined. Generally identity disclosure risk applies to partially synthetic datasets (or datasets protected with traditional SDC methods).

-   Most of these metrics rely on data maintainers essentially performing attacks against their synthetic data and seeing how successful they are at identifying individuals.

### Basic matching approaches

-   We start by making assumptions about the knowledge an attacker has (i.e., external publicly accessible data they have access to).

-   For each confidential record, the data attacker identifies a set of partially synthetic records which they believe contain the target record (i.e., potential matches) using the external variables as matching criteria.

-   There are distance-based and probability-based algorithms that can perform this matching. This matching process could be based on exact matches between variables or some relaxations (i.e., matching continuous variables within a certain radius of the target record, or matching adjacent categorical variables).

-   We then evaluate how accurate our re-identification process was using a variety of metrics.

As a simple example for the metrics we're about to cover, imagine a data attacker has access to the following external data:

```{r, echo = FALSE}
conf_data <- starwars |> 
  select(homeworld, species, name)


potential_matches_1 <- conf_data |> 
  filter(homeworld == "Naboo", species == "Gungan") 

potential_matches_2 <- conf_data |> 
  filter(homeworld == "Naboo", species == "Droid") 


external_data <- potential_matches_1 |> 
  slice(1) |> 
  bind_rows(potential_matches_2)

external_data |> 
  create_table() |> 
  tab_style(
    style = list(
      cell_fill(color = palette_urbn_magenta[2])
      ),
    locations = cells_body(
      columns = "name"
  ))


```

And imagine that the partially synthetic released data looks like this:

```{r echo = FALSE}
starwars |> 
  select(homeworld, species, skin_color) |> 
  head() |> 
  create_table()
```

Note that the released partially synthetic data does not have names. But using some basic matching rules in combination with the external data, an attacker is able to identify the following potential matches for Jar Jar Binks and R2D2, two characters in the Starwars universe:

```{r , echo = FALSE}
potential_jar_jar_matches <- starwars |> 
  select(homeworld, species, skin_color) |>
  filter(homeworld == "Naboo", species == "Gungan") |> 
  mutate(title = "Potential Jar Jar matches")

potential_r2d2_matches <- starwars |> 
  select(homeworld, species, skin_color) |>
  filter(homeworld == "Naboo", species == "Droid") |> 
  mutate(title = "Potential R2-D2 Matches")

all_matches <- potential_jar_jar_matches |> 
  bind_rows(potential_r2d2_matches)

all_matches |> 
  group_by(title) |> 
  create_table() 

# todo color in cells by true matches
```

And since we are the data maintainers, we can take a look at the confidential data and know that the highlighted rows are "true" matches.

```{r, echo = FALSE}
all_matches |> 
  group_by(title) |> 
  create_table() |> 
  tab_style(
    style = list(
      cell_fill(color = palette_urbn_magenta[2])
      ),
    locations = cells_body(
      rows = skin_color == "orange" | skin_color == "white, blue")
  )
```

These matches above are counted in various ways to evaluate identity disclosure risk. Below are some of those specific metrics. Generally for a good synthesis, we want a low expected match rate and true match rate, and a high false match rate.

::: {.panel-tabset}

#### Expected Match Rate

-   **Expected Match Rate**: On average, how likely is it to find a "correct" match among all potential matches? Essentially, the expected number of observations in the confidential data expected to be correctly matched by an intruder.

    -   Higher expected match rate = higher identification disclosure risk.

    -   The two other risk metrics below focus on the subset of confidential records for which the intruder identifies a single match.

    -   In our example, this is $\frac{1}{3} + 1 = 1.333$.

#### True Match Rate

-   **True Match Rate**: The proportion of true unique matches among all confidential records. Higher true match rate = higher identification disclosure risk.

-   Assuming there are 100 rows in the confidential data in our example, this is $\frac{1}{100} = 1\%$.

#### False Match Rate

-   **False Match Rate**: The proportion of false matches among the set of unique matches. Lower false match rate = higher identification disclosure risk.

-   In our example, this is $\frac{0}{1} = 0\%$.

:::

## Attribute Disclosure risk metrics

-   We were able to learn about Jar Jar and R2D2 by re-identifying them in the data. It is possible to learn confidential attributes without perfectly re-identifying observations in the data.

### Predictive Accuracy

::: {.callout-tip}
## Predictive Accuracy

Predictive accuracy measures how well an attacker can learn about attributes in the confidential data using the synthetic data (and possibly external data). 

:::

-   Similar to above, you start by matching synthetic records to confidential records. Alternatively, you can build a predictive model using the synthetic data to make predictions on the confidential data. 

-   **key variables**: Variables that an attacker already knows about a record and can use to match.

-   **target variables**: Variables that an attacker wishes to know more or infer about using the synthetic data.

-   Pick a sensitive variable in the confidential data and use the synthetic data to make predictions. Evaluate the accuracy of the predictions.

### Membership Inference Tests

::: {.callout-tip}
## Memebership Inference Test

**Membership inference tests** explore how well an attacker can determine if a given observations was in the training data for the synthetic data. 

:::

-   Why is this important? Sometimes membership in a synthetic dataset is also confidential (e.g., a dataset of HIV positive patients or people who have experienced homelessness).

-   Also particularly useful for fully synthetic data where identity disclosure and attribute disclosure metrics don't really make a lot of sense.

-   Assumes that attacker has access to a subset of the confidential data, and wants to tell if one or more records was used to generate the synthetic data.

-   Since we as data maintainers know the true answers, we can evaluate whether the attackers guess is true and can break it down many ways (e.g., true positives, true negatives, false positives or false negatives).

![](www/images/membership_inference_tests.png){width="688"}

source for figure: @mendelevitch2021fidelity

-   The "close enough" threshold is usually determined by a custom distance metric, like edit distance between text variables or numeric distance between continuous variables.

-   Often you will want to choose different distance thresholds and evaluate how your results change.


### Copy Protection

::: {.callout-tip}
## Copy Protection Metrics

**Copy protection metrics** measure how often the synthesizer memorizes or inadvertantly duplicates confidential records. 

:::

-   ***Distance to Closest record***: Measures distance between each real record ($r$) and the closest synthetic record ($s_i$), as determined by a distance calculation.

-   Many common distance metrics used in the literature including Euclidean distance, cosine distance, Gower distance, or Hamming distance [@mendelevitch2021fidelity].

-   Goal of this metric is to easily expose exact copies or simple perturbations of the real records that exist in the synthetic dataset.

```{r, echo = FALSE, fig.height = 3}
set.seed(123)
df <- tibble(
  dist = rnorm(n = 100, mean = 3, sd = 1)
)

df <- df |> 
  mutate(dist2 = c(rnorm(n = 75, mean = 3, sd = 1), rep(0, 25)))

good_synth <- df |> 
  ggplot() + 
  geom_histogram(aes(x = dist), binwidth = 0.4, fill = "steelblue") +
  labs(x = "DCR", y = "Count", title = "Mostly large DCR scores")

bad_synth <- df |> 
  ggplot() + 
  geom_histogram(aes(x = dist2), binwidth = 0.4, fill = "steelblue") +
  labs(x = "DCR", y = "Count", title = "Lots of 0 DCR scores")

good_synth
```


```{r,echo = FALSE, fig.height = 3}
bad_synth
```

-   Note that having DCR = 0, doesn't necessarily mean a high disclosure risk because in some datasets the "space" spanned by the variables in scope is relatively small.

### Hold Out Data

::: {.callout-note}
## Holdout Data

Membership inference tests and copy protection metrics are informative but lack context. When possible, create a holdout data set similar to the training data. Then calculate membership inference tests and copy protections metrics replacing the synthetic data with the hold out data. The results are useful for benchmarking the original membership inference tests and copy protection metrics.
:::

## `r paste("Exercise", exercise_number)`: Disclosure Metrics

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

::: {.panel-tabset}

#### <font color="#55b748">**Question**</font>

::: {#fig-grades layout-ncol=2}

```{r, echo = FALSE}
#| fig-cap: Attacker Information

potential_matches <- tribble(~Name, ~Year, ~Elective,
                              "Adam",	"2009", "Chorus",
                              "Betsy",	"2010",	"Band")

potential_matches |> 
  create_table()
```

```{r, echo = FALSE}
#| fig-cap: Partially Synthetic Data

potential_matches <- tribble(~Year, ~Elective, ~`Synthetic SAT`,
                             "2008", "Chorus", 1100,
                             "2008", "Chorus", 1420,
                             "2009", "Chorus", 900,
                             "2009", "Band", 1100,
                             "2010", "Band", 1420,
                             "2010", "Band", 900,
                             "2010", "Band", 1200)

potential_matches |> 
  create_table()
```

Attacker information and partially synthetic data

:::

* *Are there any matches for Adam?*
* *Are there any matches for Betsy?*
* *What risks are created by the release?*

#### <font color="#55b748">**Notes**</font>

::: {#fig-grades2 layout-ncol=2}

```{r, echo = FALSE}
#| fig-cap: Attacker Information

potential_matches <- tribble(~Name, ~Year, ~Elective,
                              "Adam",	"2009", "Chorus",
                              "Betsy",	"2010",	"Band")

potential_matches |> 
  create_table()
```

```{r, echo = FALSE}
#| fig-cap: Partially Synthetic Data

potential_matches <- tribble(~Year, ~Elective, ~`Synthetic SAT`,
                             "2008", "Chorus", 1100,
                             "2008", "Chorus", 1420,
                             "2009", "Chorus", 900,
                             "2009", "Band", 1100,
                             "2010", "Band", 1420,
                             "2010", "Band", 900,
                             "2010", "Band", 1200)

potential_matches |> 
  create_table() |> 
  tab_style(cell_fill(color = palette_urbn_main["magenta"], alpha = 0.3), 
                    locations = cells_body(rows = 3)) |>
    tab_style(cell_fill(color = palette_urbn_main["yellow"], alpha = 0.3), 
                    locations = cells_body(rows = 5:7))
```

Attacker information and partially synthetic data

:::

* *Are there any matches for Adam?*

Using `Year` and `Elective` as key variables, Adam has a unique match (highlighted in pink)

* *Are there any matches for Betsy?*

Using `Year` and `Elective` as key variables, Betsy has a three matches (highlighted in yellow)

* *What risks are created by the release?*

It is tough to say without context but here are a few considerations:

* Is SAT easily observable outside of the data?
* Are the values of `Synthetic SAT` close to the true values for `SAT`?
* Are `SAT` and `Synthetic SAT` likely to be close under random guessing because it has low sample variance?

:::


## `r paste("Exercise", exercise_number)`: Disclosure Metrics

```{r}
#| echo: false

exercise_number <- exercise_number + 1
```

::: {.panel-tabset}

Suppose you created a synthetic dataset of diamond prices (see example output below) to potentially help data maintainers impute values for the `cut` variable because `cut` values are often missing in other datasets. 

| carat | cut       | color | clarity |
| ----- | --------- | ----- | ------- |
| 0.23  | Ideal     | E     | SI2     |
| 0.21  | Premium   | E     | SI1     |
| 0.23  | Good      | E     | VS1     |
| 0.29  | Premium   | I     | VS2     |
| 0.31  | Good      | J     | SI2     |
| 0.24  | Very Good | J     | VVS2    |

#### <font color="#55b748">**Question**</font>

*What is an example of a utility metric you might design to ensure your synthetic data are meeting the needs of those users well?*
  
*Suppose you have a colleague that generated fully synthetic data and wants to measure identity disclosure risk. Your colleague proposes using the expected match rate, false match rate, and true match rate to evaluate how well the synthetic data protects the identity of people in the data. Why is their approach flawed and what disclosure metrics would you suggest instead?*

#### <font color="#55b748">**Notes**</font>

*What is an example of a utility metric you might design to ensure your synthetic data are meeting the needs of those users well?*

- Is the metric you proposed above a general or specific utility metric?
  - Specific Utility Metric
  - General Utility Metric
  
*Suppose you have a colleague that generated fully synthetic data and wants to measure identity disclosure risk. Your colleague proposes using the expected match rate, false match rate, and true match rate to evaluate how well the synthetic data protects the identity of people in the data. Why is their approach flawed and what disclosure metrics would you suggest instead?*

These metrics apply only to partially synthetic data. Consider instead a metric like the membership inference test.

:::

## Suggested Reading

Snoke, Joshua, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018b. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3): 663–88.

Bowen, Claire McKay, Victoria Bryant, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Philip Stallworth, Kyle Ueyama, and Aaron R Williams. 2020. “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications.” In International Conference on Privacy in Statistical Databases, 257–70. Springer.