This repository contains the Python implementation of PrivPGD, a generation method for marginal-based private data synthesis introduced in the paper Privacy-preserving data release leveraging optimal transport and particle gradient descent.
The distribution of sensitive datasets plays a key role in data-driven decision-making across many fields, including healthcare and government. Nevertheless, the release of such datasets often leads to significant privacy concerns. Differential Privacy (DP) has emerged as an effective paradigm to address these concerns, ensuring privacy preservation in our increasingly data-centric world.
PrivPGD is a novel approach for differentially private tabular data synthesis. It creates high-quality, private copies of protected tabular datasets from noisy measurements of their marginals. PrivPGD leverages particle gradient descent coupled with an optimal transport-based divergence, facilitating the efficient integration of marginal information during the dataset generation process.
Key advantages of PrivPGD include:
- State-of-the-Art Performance: Demonstrates superior performance in benchmarks and downstream tasks.
- Scalability: Features an optimized gradient computation suitable for parallelization on modern GPUs, making it particularly fast at handling large datasets and many marginals.
- Geometry Preservation: Preserves the geometry of dataset features, such as rankings, aligning more naturally with the nuances of real-world data.
- Domain-Specific Constraints Incorporation: Enables the inclusion of additional constraints in the synthetic data.
The src
folder contains the core code of the package, organized into several subfolders, each catering to specific functionalities. The codebase is heavily inspired by the PGM repository.
- Handles marginal selection and privatization.
- Key files and their corresponding mechanisms:
- Additional utility files supporting these mechanisms are also located in this folder.
- Contains the code for generation methods.
- Subfolders and their specific methods:
pgm
: Contains the implementation of the PGM method.privpgd
: Contains the PrivPGD method, our novel approach for differentially private data generation.
- Python 3.11.5
- Numpy 1.26.2
- Scipy 1.11.4
- Scikit-learn 1.2.2
- Pandas 2.1.4
- Torch 2.1.2
- CVXPY 1.4.1
- Disjoint Set 0.7.4
- Networkx 3.1
- Autodp 0.2.3.1
- POT 0.9.1
- Folktables 0.0.12
- Openml 0.14.1
- Seaborn 0.13.0
To set up your environment and install the package, follow these steps:
Start by creating a Conda environment with Python 3.11.5. This step ensures your package runs in an environment with the correct Python version.
conda create -n privpgd python=3.11.5
conda activate privpgd
There are two ways to install the package:
- Local Installation:
Start by cloning the repository from GitHub. Then, upgrade
pip
to its latest version and use the local setup files to install the package. This method is ideal for development or when you have the source code.git clone https://github.com/jaabmar/private-pgd.git cd private-pgd pip install --upgrade pip pip install -e .
- Direct Installation from GitHub (Recommended):
You can also install the package directly from GitHub. This method is straightforward and ensures you have the latest version.
pip install git+https://github.com/jaabmar/private-pgd.git
In the examples
folder, you'll find practical examples showcasing how to use the package. These examples are designed to help you understand how to apply the different mechanisms and methods included in the package.
Before running the scripts, you need to download the ACS datasets:
-
After locally downloading the repository, navigate to the
data
directory:cd path/to/private-pgd/data
-
Run the
create_data.py
script to download the datasets:python create_data.py
The datasets will be downloaded and stored in the
datasets
folder within thedata
directory.
-
experiment.py
: This is a general file for running experiments. It's a versatile script that can be used for various experiment configurations. -
mst+pgm.py
: Use this script to run experiments with the PGM generation method, using MST for marginal selection. -
aim+pgm.py
: Use this script to run experiments with the PGM generation method, using AIM for marginal selection. -
privpgd.py
: This script is dedicated to running experiments with PrivPGD, our novel approach for differentially private data synthesis. -
privpgd_with_constraint.py
: This script shows how to incorporate domain-specific constraints into PrivPGD.
To run experiments, you will interact with the scripts via the command line, and command handling is facilitated by Click (version 8.1.7). For example, to run an experiment with PrivPGD using the default hyperparameters and the setup described in our paper on the ACS Income California 2018 dataset, follow these steps:
-
Change directory (cd) to the
examples
folder. -
Run the command:
python privpgd.py
This command will initiate the experiment with PrivPGD using the specified dataset and default settings.
For a detailed, step-by-step understanding of PrivPGD, refer to the Tutorial Jupyter notebook. This notebook includes comprehensive explanations and visualizations, walking you through the entire process of using PrivPGD for differentially private data synthesis.
We welcome contributions to improve this project. Here's how you can contribute:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
For any inquiries, please reach out:
- Javier Abad Martinez - javier.abadmartinez@ai.ethz.ch
- Konstantin Donhauser - konstantin.donhauser@ai.ethz.ch
- Neha Hulkund - nhulkund@mit.edu
If you find this code useful, please consider citing our paper:
@inproceedings{donhauserprivacy,
title={Privacy-Preserving Data Release Leveraging Optimal Transport and Particle Gradient Descent},
author={Donhauser, Konstantin and Abad, Javier and Hulkund, Neha and Yang, Fanny},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
}
We are actively developing new features and improvements for our framework:
- Evaluation Pipeline: Incorporating other evaluation metrics, like differences in the covariance matrix and other higher-order queries, and assessing performance in downstream tasks.
- Benchmark with Additional Algorithms: Integration of algorithms like Private GSD, RAP, and GEM is underway, to provide a broader benchmark comparison.
These updates are part of our continuous effort to refine our framework and provide a robust benchmark in the field of differentially private data synthesis.