Skip to content

mplspunk/awesome-privacy-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 

Repository files navigation

Awesome Privacy Engineering Awesome

A curated list of resources related to privacy engineering

Content


Courses


  • OpenMined Courses
    • Our Privacy Opportunity (Beginner) (7.7 hours)
    • Introduction to Remote Data Science (Intermediate) (8 hours)
    • Foundations of Private Computation (Intermediate) (60 hours)
    • Federated Learning on Mobile (Intermediate) (40 hours) (Not Yet Released)
  • Data Privacy and Anonymization in Python - Datacamp course on learning to process sensitive information with privacy-preserving techniques.
  • Secure and Private AI (Udacity) - Udacity course that covers how to extend PyTorch with the tools necessary to train AI models that preserve user privacy.
  • Practical Data Ethics - This class was originally taught in-person at the University of San Francisco Data Institute in January-February 2020.
  • Privacy-Conscious Computer Systems - This class at Brown University (CSCI 2390) focuses on how to design computer systems that protect users' privacy.
  • Privacy by Design: Data Classification - LinkedIn Learning course by Nishant Bhajaria.
  • Privacy by Design: Data Sharing - LinkedIn Learning course by Nishant Bhajaria.
  • Implementing a Privacy, Risk, and Assurance Program - LinkedIn Learning course by Nishant Bhajaria.
  • Data Protocol - Courses to teach developers and technical professionals how to build products responsibly and partner with platforms effectively.
  • Carnegie Mellon University - Privacy Engineering Certificate - Four-week certificate program that revolves around a combination of mini-tutorials, class discussions, and hands-on exercises designed to ensure that students develop practical knowledge of all key privacy engineering areas.
  • Technical Privacy Masterclass - In four modules, this course from Privado is designed to deliver privacy leaders and their teams with an overview of the pillars of a proactive privacy program.
  • Compliance Detective - A gamified approach to learning about privacy engineering, Compliance Detective (formerly Privacy Quest) uses challenges and competitions to build your privacy and security knowledge.
  • Hitchhiker's Guide to Privacy Engineering - The goal of this creative privacy project is to offer a fun, engaging, and immersive privacy learning experience for privacy lawyers to improve their technical privacy skills.

Books


Data Deletion, Data Mapping, and Data Subject Access Requests


  • Deleting Data Distributed Throughout Your Microservices Architecture - Microservices architectures tend to distribute responsibility for data throughout an organization. This poses challenges to ensuring that data is deleted.
  • Handling Data Erasure Requests in Your Data Lake with Amazon S3 Find and Forget - Amazon S3 Find and Forget enables you to find and delete records automatically in data lakes on Amazon S3.
  • How to Delete User Data in an AWS Data Lake - This post walks through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.
  • Best Practices: GDPR and CCPA Compliance Using Delta Lake - Article that describes how to use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for a data lake.
  • Klaro! - Klaro is a simple consent management platform (CMP) and privacy tool that helps you to be transparent about the third-party applications on your website.
  • OpenDSR - A common framework enabling companies to work together to protect consumers' privacy and data rights (formerly known as OpenGDPR.)
  • PrivacyBot - PrivacyBot is a simple automated service to initiate CCPA deletion requests with data brokers. (deprecated)
  • Cookie Consent - An opensource, lightweight JavaScript plugin for alerting users about the use of cookies on a website. It is designed to help quickly comply with the European Union Cookie Law, CCPA, GDPR and other privacy laws.
  • Fides - An open-source tool that allows you to easily declare your systems' privacy characteristics, track privacy related changes to systems and data in version control, and enforce policies in both your source code and your runtime infrastructure.
    • Fideslang - Open-source description language for privacy to declare data types and data behaviors in your tech stack in order to simplify data privacy globally. Supports GDPR, CCPA, LGPD and ISO 19944.
    • Fidesops - DSAR Orchestration: Privacy Request automation to fulfill GDPR, CCPA, and LGPD data subject requests. (deprecated)
  • Privado - Privado is an open source static code analysis tool to discover data flows in the code. It detects the personal data being processed, and further maps the journey of the data from the point of collection to going to interesting sinks such as third parties, databases, logs, and internal APIs.
  • Detecting PII Using Amazon Comprehend - Using Amazon Comprehend to detect entities that contain personally identifiable information (PII) in a text document.
  • Octopii - Octopii is an open-source AI-powered PII scanner that can look for image assets such as Government IDs, passports, photos and signatures in a directory.
  • Data Profiler - DataProfiler is a Python library created by Capital One to make data analysis, monitoring, and sensitive data detection easy.
  • PII Catcher - Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub.

Privacy Tech Series by Lea Kissner


Privacy Threat Modeling


  • LINDDUN - The LINDDUN privacy engineering framework provides systematic support for the elicitation and mitigation of privacy threats in software systems.
  • LINDDUN GO - LINDDUN GO is designed to give you a quick start to privacy threat modeling.
  • PLOT4AI - Privacy Library Of Threats 4 Artificial Intelligence (PLOT4AI) is a threat modeling library to help practitioners build responsible artificial intelligence.
  • Draw.io Libraries for Threat Modeling - Collection of custom libraries for using the Draw.io diagramming application for threat modeling.
  • xCompass - A privacy threat modeling persona framework that developers can use to test and document privacy threats, and find edge cases of privacy harm (formerly named Models of Applied Privacy (MAP)).
  • Privacy Adversarial Framework (PAF) - Developed by Facebook, the Privacy Adversarial Framework (PAF) is a knowledgebase of privacy-focused adversarial tactics and techniques that is heavily inspired by MITRE ATT&CK®.
  • PANOPTIC™ Privacy Threat Model - MITRE PANOPTIC™, the Pattern and Action Nomenclature Of Privacy Threats In Context, is a privacy threat taxonomy for breaking down and describing privacy attacks against individuals and groups of individuals.

Machine Learning and Algorithmic Bias


  • Aequitas - An open source bias audit toolkit developed by the Center for Data Science and Public Policy at University of Chicago, can be used to audit the predictions of machine learning based risk assessment tools to understand different types of biases, and make informed decisions about developing and deploying such systems.
  • Ethical Machine Learning - Spotting and Preventing Proxy Bias - Jupyter Notebook from rOpenSciLabs that explores several ways of detecting unintentional bias and removing it from a predictive model. (deprecated)
  • Fairness in Machine Learning Engineering - Google's Machine Learning Crash Course includes a 70-minute section on fairness.
  • How to Incorporate Ethics and Risk into Your Machine Learning Development Process - To help highlight ethics and risk in machine learning, this article looks at the six steps involved in developing an ML system, what happens in each step, and the risk and ethics questions that arise.
  • DrivenData: Deon - A command line tool to easily add an ethics checklist to your data science projects.
  • People + AI Guidebook - A friendly, practical guide that lays out some best practices for creating useful, responsible AI applications.
    • Why Some Models Leak Data - Machine learning models use large amounts of data, some of which can be sensitive. If they're not trained correctly, sometimes that data is inadvertently revealed.
    • Datasets Have Worldviews - Every dataset communicates a different perspective. When you shift your perspective, your conclusions can shift, too.
    • Measuring Fairness - How do you make sure a model works equally well for different groups of people?
    • How Randomized Response Can Help Collect Sensitive Information Responsibly - Giant datasets are revealing new patterns in cancer, income inequality and other important areas. However, the widespread availability of fast computers that can cross reference public data is making it harder to collect private information without inadvertently violating people's privacy. Modern randomization techniques can help preserve anonymity.
    • Can a Model Be Differentially Private and Fair? - Training with differential privacy limits the information about any one data point that is extractable but in some cases there’s an unexpected side-effect: reduced accuracy with underrepresented subgroups disparately impacted.
    • Hidden Bias - Models trained on real-world data can encode real-world bias. Hiding information about protected classes doesn't always fix things — sometimes it can even hurt.
    • How Federated Learning Protects Privacy - With federated learning, it’s possible to collaboratively train a model with data from multiple users without any raw data leaving their devices.
  • Fairlearn - A Python package to assess and improve fairness of machine learning models.
  • InterpretML - A toolkit to help understand models and enable responsible machine learning.
  • ML Privacy Meter - A tool to quantify the privacy risks of machine learning models with respect to inference attacks, notably membership inference attacks
  • Privacy Considerations in Large Language Models - The potential for models to leak details from the data on which they’re trained may be a concern for all large language models, and additional issues may arise if a model trained on private data were to be made publicly available.
  • Explaining Decisions Made with AI - Guidance by the UK's Information Commissioner's Office (ICO) and The Alan Turing Institute aims to give organisations practical advice to help explain the processes, services and decisions delivered or assisted by AI, to the individuals affected by them.
  • Responsible AI Toolbox - Responsible AI Toolbox is a suite of tools from Microsoft that provides a collection of model and data exploration and assessment user interfaces that enable a better understanding of AI systems. The Toolbox consists of four dashboards: an Error Analysis dashboard, an Interpretability dashboard, a Fairness dashboard, and a Responsible AI dashboard.
  • Of Oaths and Checklists - A checklist for people who are working on data projects, authored by DJ Patil, Hilary Mason, and Mike Loukides.
  • Intro to AI Ethics - A Kaggle Learn course to explore practical tools to guide the moral design of AI systems.
  • Failure Modes in Machine Learning - Documentation compiled by Microsoft regarding the different ways that machine learning can fail, both intentionally (through adversarial attack) and unintentionally (formally correct but completely unsafe outcome).
  • Apple Privacy-Preserving Machine Learning Workshop 2022 - In June 2022, Apple hosted the Workshop on Privacy-Preserving Machine Learning (PPML), which brought Apple and members of the academic research communities together to discuss the state of the art in the field of privacy-preserving machine learning through a series of talks and discussions. This post includes highlights from workshop discussions and recordings of select workshop talks.
  • Machine Unlearning - A compilation of existing literature about machine unlearning, a process through which a machine learning model can be made to forget one of its training data points.
  • Fairness and Machine Learning: Limitations and Opportunities - An online textbook by Solon Barocas, Moritz Hardt, and Arvind Narayanan.
  • AI Fairness 360 (AIF360) - A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
  • Model Card Toolkit - Google's Model Card Toolkit streamlines and automates generation of Model Cards, machine learning documents that provide context and transparency into a model's development and performance.
  • Responsible AI by Design - Microsoft's hub for policies, practices, and tools that make up its framework for Responsible AI by Design. Includes a Responsible AI Standard, Responsible AI Impact Assessment Guide, and Responsible AI Impact Assessment Template.
  • Private AI Bootcamp - Youtube playlist of lectures from the Private AI Bootcamp at Microsoft Research Redmond in December 2019.
  • Adversarial Robustness Toolbox (ART) - Python library from the Linux Foundation AI & Data Foundation (LF AI & Data) that enables developers and researchers to defend and evaluate machine learning models and applications against the adversarial threats of evasion, poisoning, extraction, and inference.
  • Trustworthy ML Initiative - The Trustworthy ML Initiative is a community of researchers and practitioners working on topics related to machine learning models and algorithms that are accurate, explainable, fair, privacy-preserving, causal, and robust.
  • AI Nutrition Facts Labels - Tool from Twilio that allows generation of AI Nutrition Labels intended to give consumers and businesses a more transparent and clear view into ‘what's in the box’.
  • Explainable Artificial Intelligence - This course syllabus from Harvard University aims to familiarize students with the recent advances in the emerging field of eXplainable Artificial Intelligence (XAI).
  • SecretFlow - SecretFlow is a unified framework for privacy-preserving data analysis and machine learning.

Facial Recognition


De-Identification and Anonymization


  • A Visual Guide to Practical Data De-Identification (FPF Infographic)
  • NIST Privacy Engineering Program - De-Identification Tools
  • Presidio - Context aware, pluggable and customizable PII anonymization service for text and images, developed by Microsoft.
  • Redacting Sensitive Information with User-Defined Functions in Amazon Athena - Amazon Athena supports user-defined functions, a feature that enables you to write custom scalar functions and invoke them in SQL queries.
  • AWS AI-Powered Health Data Masking - The AI-Powered Health Data Masking solution in the AWS Solutions Library helps healthcare organizations identify and mask health data in images or text. (deprecated)
  • Anonymize Your Data Using Amazon S3 Object Lambda - Leverage AWS S3 Object Lambdas in order to anonymize data.
  • Static Data Masking for Azure SQL Database and SQL Server - Microsoft's Static Data Masking is a data protection feature that helps users sanitize sensitive data in a copy of their SQL databases. It is compatible with SQL Server (SQL Server 2012 and newer), Azure SQL Database (DTU and vCore-based hosting options, excluding Hyperscale), and SQL Server on Azure Virtual Machines.
  • Google Cloud Data Loss Prevention - Google Cloud's fully managed service designed to help you discover, classify, and protect sensitive data.
  • ARX Data Anonymization Tool - ARX is a comprehensive open source software for anonymizing sensitive personal data.
  • UTD Anonymization ToolBox - UT Dallas Data Security and Privacy Lab compiled various anonymization methods into a toolbox for public use by researchers.
  • Kodex - An open-source toolkit for privacy and security engineering. It helps you to automate data security and data protection measures in your data engineering workflows.
  • Data Anonymizer Extension for PostgreSQL - A set of SQL functions that remove personally identifiable values from a PostgreSQL table and replace them with random-but-plausible values.
  • Anonimatron - Free, extendable, open source data anonymization tool.
  • Anonymizer MySQL - This simple tool will allow you to make anonymizerd clone of your database.
  • MySQL Data Anonymizer - MySQL Data Anonymizer is a PHP library that anonymizes your data in the database.
  • Anonymizer - Anonymizer is a universal tool to create anonymized DBs for projects.
  • anonymize-it - The Elastic Machine Learning Team's general purpose tool for suppression, masking, and generalization of fields to aid data pseudonymization.
  • Singapore Guide to Anonymization - The Singapore Personal Data Protection Commission (PDPC) has published the Guide on Basic Anonymization to provide more practical guidance for businesses on how to appropriately perform basic anonymization and de-identification of various datasets.
  • Transforming Data in Google Cloud Platform - This reference covers the available de-identification techniques, or transformations, that can be applied in Google Cloud's Data Loss Prevention (i.e., redaction, replacement, masking, crypto-based tokenization, bucketing, date shifting, and time extraction).
  • Measuring Re-Identification Risk in Datasets / Privacy Definitions - A series of helpful blog posts by Damien Desfontaines on the definitions of k-anonymity, k-map, l-diversity, and delta-presence. Additionally, a video by Utrecht University on the definition of t-closeness.
  • Technical Privacy Metrics: a Systematic Survey - Paper by Isabel Wagner and David Eckhoff that discusses over 80 privacy metrics and introduces categorizations based on the aspect of privacy they measure, their required inputs, and the type of data that needs protection. They also present a method on how to choose privacy metrics based on nine questions that help identify the right privacy metrics for a given scenario.
  • Data Anonymization Tool - The Singapore PDPC has launched a free Data Anonymization tool to help organizations transform simple datasets by applying basic anonymization techniques.
  • Masked AI - Python SDK and CLI wrappers that enable safer usage of public large language models (LLMs) like OpenAI/GPT4 by removing sensitive data from prompts and replacing it with fake data before submitting to the OpenAI API.

Homomorphic Encryption


  • Building Safe A.I.: A Tutorial for Encrypted Deep Learning - Blogpost on how to train a neural network that is fully encrypted during training.
  • HElib - HElib is an open-source software library that implements homomorphic encryption.
  • Microsoft SEAL - Microsoft SEAL is an easy-to-use open-source (MIT licensed) homomorphic encryption library developed by the Cryptography and Privacy Research group at Microsoft.
  • nGraph-HE: A Graph Compiler for Deep Learning on Homomorphically Encrypted Data - Intel Research proposes an extension to its deep learning compiler to operate on homomorphically encrypted data.
  • Google Fully-Homomorphic-Encryption - This repository created by Google contains open-source libraries and tools to perform fully homomorphic encryption operations on an encrypted data set.
  • Palisade Homomorphic Encryption Software Library - An open-source project that provides efficient implementations of lattice cryptography building blocks and homomorphic encryption schemes.
  • TFHE - The original version of TFHE (Fast Fully Homomorphic Encryption Library over the Torus) that implements the base arithmetic and functionalities (bootstrapped and leveled), allowing you to perform computations over encrypted data.
  • Concrete - The concrete ecosystem is a set of crates (packages in the Rust language) that implements Zama's variant of TFHE, while most of the complexity of fully homomorphic encryption is hidden under high-level APIs.
  • FHE.org - Community of researchers and developers interested in advancing Fully Homomorphic Encryption (FHE) and other secure computation techniques.
  • blyss - Open-source SDK for accessing data privately using homomorphic encryption.
  • swift-homomorphic-encryption - Apple's open source Swift package that utilizes Private Information Retrieval (PIR).

Tokenization


  • AWS Serverless Tokenization - Learn how to use Lambda Layers to develop a serverless tokenization solution in AWS.
  • auto-data-tokenize - This repo demonstrates a reference implementation of detecting and tokenizing sensitive structured data within Google Cloud Platform.

Secure Multi-Party Computation


  • Private Join and Compute - Google's implementation of the "Private Join and Compute" functionality. This functionality allows two users, each holding an input file, to privately compute the sum of associated values for records that have common identifiers.
  • Facebook Private Computation Solutions - Facebook Private Computation Solutions (FBPCS) is a secure, privacy safe and scalable architecture to deploy multi-party computation applications in a distributed way on virtual private clouds via Private Scaling architecture. FBPCS consists of various services, interfaces that enable various private measurement solutions, e.g. Private Lift.
  • Facebook Private Computation Framework - Facebook Private Computation Framework (FBPCF) library allows developers to perform randomized controlled trials, without leaking information about who participated or what action an individual took. It uses secure multiparty computation to guarantee this privacy. FBPCF is for scaling multi-party computation up via threading.
  • EzPC (Easy Secure Multi-party Computation) - EzPC is a Microsoft Research tool that converts Tensorflow and ONNX models into Secure Multi-Party Computation protocols.

Synthetic Data


  • Data Synthesizer - DataSynthesizer generates synthetic data that simulates a given dataset.
  • Faker - Faker is a Python package that generates fake data for you.
  • Pynonymizer - Pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.
  • Synthetic Data Vault - The Synthetic Data Vault (SDV) enables end users to easily generate Synthetic Data for different data modalities, including single table, multi-table and time series data.
  • Synthetic Data Generation: Quality, Privacy, Bias (Workshop at ICLR 2021) - Workshop on the intersection of challenges regarding quality, privacy and bias in synthetic data generation.
  • synthpop - R package for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis.
  • Synthea - An open-source, synthetic patient generator that models the medical history of synthetic patients.
  • Presidio Evaluator - Data Generator - This data generator takes a text file with templates (e.g. my name is x]) and creates a list of InputSamples which contain fake PII entities instead of placeholders.
  • Mimesis - Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.
  • plaitpy - plait.py is a program for generating fake data from composable yaml templates.
  • Bogus - Bogus is a simple fake data generator for .NET languages like C#, F# and VB.NET.
  • Gretel AI:
  • Differentially Private Synthetic Data via Foundation Model APIs (DPSDA) - This repo is a Python library to generate differentially private synthetic data without the need of any ML model training.

Differential Privacy and Federated Learning


Designing for Trust with Users


Deceptive Design Patterns


Tagging Personally Identifiable Information


  • Managing Tags in AWS Resource Groups - Tags are words or phrases that act as metadata that you can use to identify and organize your AWS resources. A resource can have up to 50 user-applied tags.
  • Categorizing Your AWS S3 Storage Using Tags - In addition to data classification, tagging offers benefits such as fine-grained access control of permissions and object lifecycle management.
  • Quickstart for Tagging Tables in Google Cloud - Tutorial shows how to create a BigQuery dataset, copy data to a new table in your dataset, create a tag template, and attach the tag to your table.
  • Using Policy Tags in Google Cloud's BigQuery - Use policy tags to define access to your data, for example, when you use BigQuery column-level security.
  • Adding a Tag-Based PII Policy in Cloudera - How to add a PII tag-based policy. In this example, the author creates a tag-based policy for objects tagged "PII" in Atlas.
  • BigQuery PII Classifier - Google Cloud BigQuery PII Classifier is a solution to automate the process of discovering and tagging PII data across BigQuery tables and applying column-level access controls to restrict specific PII data types to certain users/groups.

Regulatory and Framework Resources


Conferences


Career


Miscellaneous


Other Awesome Privacy Curations


Related GitHub Topics