You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering if it was useful to add new metrics to DPSDGYM to account for a better (and more general) evaluation.
I have found this paper from the medical research community that tried to solve exactly this evaluation problem (how to compare two datasets). The metrics that they propose up with are:
KL Divergence for categorical attributes. They use it to measure the information gain of using the Real Dataset over the synthetic one (in other words, it’s a measure of utility of the synthetic dataset). It could be interesting to expand this to numerical attributes, using a KDE and then compute the KL Divergence.
Pairwise Correlation Difference for numerical attributes. Basically we take the linear correlations of the numerical variables and then we compute the difference between the Synth and Real correlation datasets. Then we square those numbers and we add them up together, having a metric of how well your data reproduces the linear correlations of the Real Datasets. It could be interesting to expand this to categorical attributes, but this requires the use of different correlations (I am using Pearson’s), but we need to see what happens when we sum them up togheter.
Log Cluster: combine the synthetic and real dataset into one and then apply KMeans on it with 20 clusters. For each cluster, compute the % of points that come from the real dataset and plug this value into a small rescaling formula with the logarithm in it. It measures the differences in the distributions of the points in the real and synthetic datasets between ALL attributes: if this value is high, the Real and Fake data points are distributed very differently across the input space.
Cross Classification: this has two variations.
Cross Classification Real-Synth: for each variable as target, train a model on real and then evaluate on real and synth (with F1, R2 or any other metric depending on the task). Compute the ratio between these accuracies (real/synth) and average all of them together. If this metric is 1, the Synthetic dataset is capable of being explained using the true data. If it is lower than 1, we explain better the synth data than the real_test data. If it is greater, the opposite.
Cross classification Synth-Real: exactly as above, but with Synth and real swapped. Used to measure the power of the synthetic dataset to explain the real data points.
They also higlight other two metrics for evaluating the privacy of the real dataset (Membership disclosure and attribute disclosure) that could be used on top of the Differential Privacy guarantees.
Do you think that this is worth exploring and (possibly) add them to the evaluation suite?
The text was updated successfully, but these errors were encountered:
Thanks for the thorough issue. I think many of these metrics are worth adding to the library if we can organize them well.
One way to group and organize the metrics is to set up knobs to compare them.
Currently the evaluation suite has 3 metric calculations that can be grouped up into 2 categories:
1, Distributional similarities
2, Utility for ML scenarios
It looks like the first two suggested metrics fall under 1. The 3rd and 4th metrics fall under 2.
Within these bounds we have a few engineering requirements to consider:
API: input + outputs, ideally we can have some consistency here if possible with sklearn metrics
run time performance, how long do these calculations take?
I definitely think these are worth adding to the evaluation suite. But even more so, I think opendp.smartnoise.metrics should be graduated out of the tests/dpsdgym with the existing 3 metrics and possibly the 5-6 metrics mentioned above.
This amount of work should be shared though, so please let us know which items you are planning on starting with so we can help parallelize as bandwidth opens up.
Also, are the methods from the paper reproducible from the primitives in opacus or opendp.smartnoise.synthesizers? It would be worth reproducing the results with the given metrics as a way of validating correctness while we look to adding them to the library.
Thanks again for a thorough analysis. I will look into the linked paper soon.
I was wondering if it was useful to add new metrics to DPSDGYM to account for a better (and more general) evaluation.
I have found this paper from the medical research community that tried to solve exactly this evaluation problem (how to compare two datasets). The metrics that they propose up with are:
They also higlight other two metrics for evaluating the privacy of the real dataset (Membership disclosure and attribute disclosure) that could be used on top of the Differential Privacy guarantees.
Do you think that this is worth exploring and (possibly) add them to the evaluation suite?
The text was updated successfully, but these errors were encountered: