Privacy-Optimized RR for Sharing Multi-Attribute Data

This contains Python codes of our experiments on privacy level (for the entire dataset), example of data analysis (using genome statistics), and run time.

In the evaluation of privacy level, we compared our two methods (optimal solution that solves a linear programming problem and heuristic method) with the existing Kronecker product-based method [Wang et al., 2016], when the privacy level for each attribute information is given.

As an analysis example using our method, we considered the case where the privacy level for the entire dataset is fixed and evaluated the accuracy of output (differentially private) $χ^{2}$ -statistics in genomic statistical analysis.

The run time results show that our heuristic method can be performed in about $1$ second even for a large dataset with $k = 1, 000$ . In Run Time folder, we also provide the results when $k = 10, 000$ , $k = 100, 000$ , and $k = 12$ for reference. When $k = 12$ , our optimal method took about $2$ hours, which seems to be the limit at which it can be performed in a practical time.

In Supplements.pdf, we provide a review of related studies, concrete descriptions of our methods, and experimental results and discussion on analysis example.

Important Note

Elements of Distortion Matrix $P$

$X_{j}$ described in Methods Section of our paper is the probability of individual events that hold the condition associated with $X_{j}$ , under the context of Randomized Response. In this study, we assume that the probabilities of all these events are equally $X_{j}$ . (Under this assumption, the distiorion matrix $P$ becomes a symmetric matrix.)

For example,
$X_{0} = Pr [Output [1] = d_{1} \land Output [2] = d_{2} \land \dots \land Output [k] = d_{k} | Input [1] = d_{1} \land Input [2] = d_{2} \land \dots \land Input [k] = d_{k}]$
$= Pr [Output [1] = d_{1}^{'} \land Output [2] = d_{2}^{'} \land \dots \land Output [k] = d_{k}^{'} | Input [1] = d_{1}^{'} \land Input [2] = d_{2}^{'} \land \dots \land Input [k] = d_{k}^{'}]$
$= \dots$ ,
$X_{1} = Pr [Output [1] = p_{1} \land Output [2] = d_{2} \land \dots \land Output [k] = d_{k} | Input [1] = d_{1} \land Input [2] = d_{2} \land \dots \land Input [k] = d_{k}]$
$= Pr [Output [1] = p_{1}^{'} \land Output [2] = d_{2}^{'} \land \dots \land Output [k] = d_{k}^{'} | Input [1] = d_{1}^{'} \land Input [2] = d_{2}^{'} \land \dots \land Input [k] = d_{k}^{'}]$
$= \dots$ ,
where $d_{i}, d_{i}^{'}, p_{1}, p_{1}^{'}$ are possible attribute values, $d_{1} \neq p_{1}$ , and $d_{1}^{'} \neq p_{1}^{'}$ .

Note that
$X_{0}$ is not equal to $Pr [Input [1] = Output [1] \land Input [2] = Output [2] \land \dots \land Input [k] = Output [k]]$ or $Pr [Input [1] = d_{1} \land Output [1] = d_{1} \land Input [2] = d_{2} \land Output [2] = d_{2} \land \dots \land Input [k] = d_{k} \land Output [k] = d_{k}]$ ,
and
$X_{1}$ is not equal to $Pr [Input [1] \neq Output [1] \land Input [2] = Output [2] \land \dots \land Input [k] = Output [k]]$ or $Pr [Input [1] = d_{1} \land Output [1] \neq d_{1} \land Input [2] = d_{2} \land Output [2] = d_{2} \land \dots \land Input [k] = d_{k} \land Output [k] = d_{k}]$ or $Pr [Output [1] \neq d_{1} \land Output [2] = d_{2} \land \dots \land Output [k] = d_{k} | Input [1] = d_{1} \land Input [2] = d_{2} \land \dots \land Input [k] = d_{k}]$ .

And then, the relations
$Pr [Output [i] = u | Input [i] = u] = \sum_{j : i \notin S_{j}} X_{j} \cdot t_{j}$
$and Pr [Output [i] = u | Input [i] = v] = \sum_{j : i \in S_{j}} \frac{X_{j} \cdot t_{j}}{a_{i} - 1}$
hold. (For details, please refer to our paper.)

Another Solution to Find "Near-Optimal" Solution

We may construct a linear programming problem for $k + 1$ variables (from $x_{k, 0}$ to $x_{k, k}$ in our proposed method).

However, the LP problem with variable reduction is often infeasible (cf. LP_variableReduction.ipynb). Moreover, our method is (currently) more efficient than solving the linear programming problem (cf. Cohen, Lee, and Song, 2021 and Brand, 2020); therefore, to efficiently construct a randomized response that achieves stronger privacy guarantees than the method using the Kronecker product, our method might be more useful (and is more interesting).

Of course, there may be cases where solving the linear programming problem would be superior, depending on the optimality, the acheved $ϵ_{i}$ , etc. This will be an important issue to be addressed in the future.

One Possible Policy to Distribute Privacy Budgets (when using our methods)

Set the minimum privacy level to be guaranteed for each of the $k$ -attribute information as $ϵ_{1}, ϵ_{2}, \dots, ϵ_{k}$ , respectively. Under this condition, consider increasing the accuracy of data analysis as much as possible, when the privacy level for the entire dataset is fixed.

If $ϵ_{1}, ϵ_{2}, \dots, ϵ_{k}$ are all distributable, distribute as is.
If not distributable, prioritize reducing the budget for information that does not require a high degree of accuracy or that can be expected to be accurate even with a small $ϵ$ . (Distibute a budget of almost $ϵ_{i}$ as much as possible for information that requires a high degree of accuracy.)
Check whether the updated budgets are distributable, using our methods each time.
Ultimately, create a situation where privacy guarantees of $ϵ_{1}, ϵ_{2}, \dots, ϵ_{k}$ or better are achieved for all information.

Note that our methods are expected to distribute more privacy budgets than the existing Kronecker product-based method (as shown in our results of analysis example).

Although it is also possible to regard $k$ -attribute information collectively as single attribute information and employ the most basic randomized response, the privacy level for each attribute information cannot be varied in such a case; therefore, we did not discuss it in this study.

Future Directions

・The required privacy level for each information should vary depending on the kinds of data and the analysis purpose. In the future, it would be desirable to evaluate not only the privacy level as in our experiments, but also the accuracy in more detail for each analysis method. Ultimately, it would be fascinating to develop an optimal randomized response in terms of the trade-off between privacy and utility. (Are there any cases where the distortion matrix $P$ should not be a symmetric matrix?)

・Improving and theoretically analyzing the optimality of our heuristic method (especially when $ϵ_{i}$ is small or $a_{i}$ is large), with reference to the discussion in Experiments and Discussion Section of our paper. (We should consider a few more matrix elements as variables for events that more than one attribute values differ. ← In this study, all those values are set to equal in the heuristic method.)

・Developing better heuristic methods that can achieve the privacy levels that are as close to the input ones as possible. (We feel that this task is not be as easy as it may seem. Our proposed algorithm can achieve privacy levels closer to the input ones by changing the order of attributes, for example; however, a method that does not require such a procedure like parameter-tuning would be more desirable.) (Note that for the experiments in Privacy Level folder, we considered situations in which the achieved $ϵ_{i}$ were the same among the comparison methods.)

・Developing optimal methods for multi-dimensional "numeric" data. (This study focused on categorical data.) ← One possible derection is to extend Duchi et al., 2018 and Wang et al., 2019's methods by combining them with our methods. As for Algorithm 4 in the paper of Wang et al., given that it can also handle categorical data, our proposed methods already have the potential to increase the privacy budget ( $ϵ / k$ ) in Step 5 when handling categorical data and improve the output accuracy.

・Enhancing our methods under other concepts of differential privacy like One-Sided Differential Privacy [Kotsogiannis et al., 2020].

・Considering the Randomized Response techniques under $(ϵ, δ)$ -differential privacy (cf. the discussion in the Conclusion Section).

Note

For details of our methods and discussion, please see our paper entitled "Privacy-Optimized Randomized Response for Sharing Multi-Attribute Data" (https://doi.org/10.1109/ISCC61673.2024.10733730, arXiv: http://arxiv.org/abs/2402.07584) presented at IEEE ISCC 2024.

Contact

Akito Yamamoto

Division of Medical Data Informatics, Human Genome Center,

the Institute of Medical Science, the University of Tokyo

a-ymmt@ims.u-tokyo.ac.jp

Name	Name	Last commit message	Last commit date
Latest commit ay0408 Update README.md Feb 3, 2025 b70d7c0 · Feb 3, 2025 History 88 Commits
Privacy Level	Privacy Level	Update Privacy Level files	Jun 7, 2024
Run Time	Run Time	Update Run Time files	Nov 27, 2023
Example.ipynb	Example.ipynb	Add analysis example file	Nov 11, 2023
LP_variableReduction.ipynb	LP_variableReduction.ipynb	Add a LP solution with variable reduction.	Jan 17, 2025
README.md	README.md	Update README.md	Feb 3, 2025
Supplements.pdf	Supplements.pdf	Add Supplements.pdf	Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Privacy-Optimized RR for Sharing Multi-Attribute Data

Important Note

Elements of Distortion Matrix $P$

Another Solution to Find "Near-Optimal" Solution

One Possible Policy to Distribute Privacy Budgets (when using our methods)

Future Directions

Note

Contact

About

Releases

Packages

Languages

ay0408/Optimized-RR

Folders and files

Latest commit

History

Repository files navigation

Privacy-Optimized RR for Sharing Multi-Attribute Data

Important Note

Elements of Distortion Matrix P

Another Solution to Find "Near-Optimal" Solution

One Possible Policy to Distribute Privacy Budgets (when using our methods)

Future Directions

Note

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Elements of Distortion Matrix $P$

Packages