PrivPGD: Particle Gradient Descent and Optimal Transport for Private Tabular Data Synthesis

This repository contains the Python implementation of PrivPGD, a generation method for marginal-based private data synthesis introduced in the paper Privacy-preserving data release leveraging optimal transport and particle gradient descent.

Overview
Contents
Getting Started
Examples and Tutorial
Contributing
Contact
Citation

Overview

The distribution of sensitive datasets plays a key role in data-driven decision-making across many fields, including healthcare and government. Nevertheless, the release of such datasets often leads to significant privacy concerns. Differential Privacy (DP) has emerged as an effective paradigm to address these concerns, ensuring privacy preservation in our increasingly data-centric world.

PrivPGD is a novel approach for differentially private tabular data synthesis. It creates high-quality, private copies of protected tabular datasets from noisy measurements of their marginals. PrivPGD leverages particle gradient descent coupled with an optimal transport-based divergence, facilitating the efficient integration of marginal information during the dataset generation process.

Key advantages of PrivPGD include:

State-of-the-Art Performance: Demonstrates superior performance in benchmarks and downstream tasks.
Scalability: Features an optimized gradient computation suitable for parallelization on modern GPUs, making it particularly fast at handling large datasets and many marginals.
Geometry Preservation: Preserves the geometry of dataset features, such as rankings, aligning more naturally with the nuances of real-world data.
Domain-Specific Constraints Incorporation: Enables the inclusion of additional constraints in the synthetic data.

Handles marginal selection and privatization.
Key files and their corresponding mechanisms:
- kway.py: Implements the K-Way mechanism.
- mwem.py: Implements the MWEM.
- mst.py: Implements the MST mechanism.
- aim.py: Implements the AIM.
Additional utility files supporting these mechanisms are also located in this folder.

2. Generation Methods (`src/inference`):

Contains the code for generation methods.
Subfolders and their specific methods:
- pgm: Contains the implementation of the PGM method.
- privpgd: Contains the PrivPGD method, our novel approach for differentially private data generation.

Getting Started

Dependencies

Python 3.11.5
Numpy 1.26.2
Scipy 1.11.4
Scikit-learn 1.2.2
Pandas 2.1.4
Torch 2.1.2
CVXPY 1.4.1
Disjoint Set 0.7.4
Networkx 3.1
Autodp 0.2.3.1
POT 0.9.1
Folktables 0.0.12
Openml 0.14.1
Seaborn 0.13.0

Installation

To set up your environment and install the package, follow these steps:

Create and Activate a Conda Environment

Start by creating a Conda environment with Python 3.11.5. This step ensures your package runs in an environment with the correct Python version.

conda create -n privpgd python=3.11.5
conda activate privpgd

Install the Package

There are two ways to install the package:

Local Installation: Start by cloning the repository from GitHub. Then, upgrade pip to its latest version and use the local setup files to install the package. This method is ideal for development or when you have the source code.
```
git clone https://github.com/jaabmar/private-pgd.git
cd private-pgd
pip install --upgrade pip
pip install -e .
```
Direct Installation from GitHub (Recommended): You can also install the package directly from GitHub. This method is straightforward and ensures you have the latest version.
```
pip install git+https://github.com/jaabmar/private-pgd.git
```

Examples and Tutorial

In the examples folder, you'll find practical examples showcasing how to use the package. These examples are designed to help you understand how to apply the different mechanisms and methods included in the package.

Preparing the Data

Before running the scripts, you need to download the ACS datasets:

After locally downloading the repository, navigate to the data directory:
```
cd path/to/private-pgd/data
```
Run the create_data.py script to download the datasets:
```
python create_data.py
```
The datasets will be downloaded and stored in the datasets folder within the data directory.

Key Experiment Scripts

experiment.py: This is a general file for running experiments. It's a versatile script that can be used for various experiment configurations.
mst+pgm.py: Use this script to run experiments with the PGM generation method, using MST for marginal selection.
aim+pgm.py: Use this script to run experiments with the PGM generation method, using AIM for marginal selection.
privpgd.py: This script is dedicated to running experiments with PrivPGD, our novel approach for differentially private data synthesis.
privpgd_with_constraint.py: This script shows how to incorporate domain-specific constraints into PrivPGD.

Running Experiments

To run experiments, you will interact with the scripts via the command line, and command handling is facilitated by Click (version 8.1.7). For example, to run an experiment with PrivPGD using the default hyperparameters and the setup described in our paper on the ACS Income California 2018 dataset, follow these steps:

Change directory (cd) to the examples folder.
Run the command:
```
python privpgd.py
```

This command will initiate the experiment with PrivPGD using the specified dataset and default settings.

Step-by-Step Tutorial

For a detailed, step-by-step understanding of PrivPGD, refer to the Tutorial Jupyter notebook. This notebook includes comprehensive explanations and visualizations, walking you through the entire process of using PrivPGD for differentially private data synthesis.

Contributing

We welcome contributions to improve this project. Here's how you can contribute:

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contact

For any inquiries, please reach out:

Javier Abad Martinez - javier.abadmartinez@ai.ethz.ch
Konstantin Donhauser - konstantin.donhauser@ai.ethz.ch
Neha Hulkund - nhulkund@mit.edu

Citation

If you find this code useful, please consider citing our paper:

@inproceedings{donhauserprivacy,
 title={Privacy-Preserving Data Release Leveraging Optimal Transport and Particle Gradient Descent},
 author={Donhauser, Konstantin and Abad, Javier and Hulkund, Neha and Yang, Fanny},
 booktitle={Forty-first International Conference on Machine Learning},
 year={2024},
}

Work in Progress

We are actively developing new features and improvements for our framework:

Evaluation Pipeline: Incorporating other evaluation metrics, like differences in the covariance matrix and other higher-order queries, and assessing performance in downstream tasks.
Benchmark with Additional Algorithms: Integration of algorithms like Private GSD, RAP, and GEM is underway, to provide a broader benchmark comparison.

These updates are part of our continuous effort to refine our framework and provide a robust benchmark in the field of differentially private data synthesis.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
examples		examples
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivPGD: Particle Gradient Descent and Optimal Transport for Private Tabular Data Synthesis

Overview

Contents

1. Mechanisms (`src/mechanisms`):

2. Generation Methods (`src/inference`):

Getting Started

Dependencies

Installation

Create and Activate a Conda Environment

Install the Package

Examples and Tutorial

Preparing the Data

Key Experiment Scripts

Running Experiments

Step-by-Step Tutorial

Contributing

Contact

Citation

Work in Progress

About

Releases

Packages

Contributors 2

Languages

License

jaabmar/private-pgd

Folders and files

Latest commit

History

Repository files navigation

PrivPGD: Particle Gradient Descent and Optimal Transport for Private Tabular Data Synthesis

Overview

Contents

1. Mechanisms (src/mechanisms):

2. Generation Methods (src/inference):

Getting Started

Dependencies

Installation

Create and Activate a Conda Environment

Install the Package

Examples and Tutorial

Preparing the Data

Key Experiment Scripts

Running Experiments

Step-by-Step Tutorial

Contributing

Contact

Citation

Work in Progress

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1. Mechanisms (`src/mechanisms`):

2. Generation Methods (`src/inference`):

Packages