|
1 | 1 | ---
|
2 |
| -title: "Using Markdown" |
| 2 | +title: "Introduction to privacy enhancing technologies (PET)" |
3 | 3 | teaching: 10
|
4 | 4 | exercises: 2
|
5 | 5 | ---
|
6 | 6 |
|
7 |
| -:::::::::::::::::::::::::::::::::::::: questions |
| 7 | +:::::::::::::::::::::::::::::::::::::: questions |
8 | 8 |
|
9 |
| -- How do you write a lesson using Markdown and `{sandpaper}`? |
| 9 | +- TODO |
10 | 10 |
|
11 | 11 | ::::::::::::::::::::::::::::::::::::::::::::::::
|
12 | 12 |
|
13 | 13 | ::::::::::::::::::::::::::::::::::::: objectives
|
14 | 14 |
|
15 |
| -- Explain how to use markdown with The Carpentries Workbench |
16 |
| -- Demonstrate how to include pieces of code, figures, and nested challenge blocks |
17 |
| - |
| 15 | +- Understand PET, FL, MPC, homomorphic encryption, differential privacy |
| 16 | +- Understand how different PET techniques relate |
| 17 | +- Understand scenarios where PET could be applied |
| 18 | +- Understand horizontal vs vertical partitioning |
| 19 | +- Decompose a simple analysis in a federated way |
| 20 | +- Understand that there is paperwork to be done (DPIA etc.) |
18 | 21 | ::::::::::::::::::::::::::::::::::::::::::::::::
|
19 | 22 |
|
20 |
| -## Introduction |
21 |
| - |
22 |
| -This is a lesson created via The Carpentries Workbench. It is written in |
23 |
| -[Pandoc-flavored Markdown](https://pandoc.org/MANUAL.txt) for static files and |
24 |
| -[R Markdown][r-markdown] for dynamic files that can render code into output. |
25 |
| -Please refer to the [Introduction to The Carpentries |
26 |
| -Workbench](https://carpentries.github.io/sandpaper-docs/) for full documentation. |
27 |
| - |
28 |
| -What you need to know is that there are three sections required for a valid |
29 |
| -Carpentries lesson: |
30 |
| - |
31 |
| - 1. `questions` are displayed at the beginning of the episode to prime the |
32 |
| - learner for the content. |
33 |
| - 2. `objectives` are the learning objectives for an episode displayed with |
34 |
| - the questions. |
35 |
| - 3. `keypoints` are displayed at the end of the episode to reinforce the |
36 |
| - objectives. |
37 |
| - |
38 |
| -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor |
39 |
| - |
40 |
| -Inline instructor notes can help inform instructors of timing challenges |
41 |
| -associated with the lessons. They appear in the "Instructor View" |
42 |
| - |
43 |
| -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: |
| 23 | +## Problem statement |
44 | 24 |
|
45 |
| -::::::::::::::::::::::::::::::::::::: challenge |
| 25 | +The amount of data being generated nowadays is absolutely mind-boggling. This data can be a valuable |
| 26 | +resource for researchers. However, personal data should be handled with great care and |
| 27 | +responsibility |
| 28 | +because of its sensitive nature. This is why there are privacy regulations in place like |
| 29 | +[GDPR](https://gdpr-info.eu/) to prohibit easy access to this wealth of data. |
46 | 30 |
|
47 |
| -## Challenge 1: Can you do it? |
| 31 | +However, often researchers are not interested in the personal records that make up the data, but |
| 32 | +rather |
| 33 | +in the *insights* derived from it. This raises an intriguing question: Can we unlock these valuable |
| 34 | +insights in a manner that upholds and respects privacy standards? |
48 | 35 |
|
49 |
| -What is the output of this command? |
| 36 | +In classic data analysis, all data is copied over into a single place. This makes it very easy to |
| 37 | +use |
| 38 | +conventional data analysis software and tools to gain insights. |
50 | 39 |
|
51 |
| -```r |
52 |
| -paste("This", "new", "lesson", "looks", "good") |
53 |
| -``` |
| 40 | +{alt='Two tables |
| 41 | +with data are moved to a central location'} |
54 | 42 |
|
55 |
| -:::::::::::::::::::::::: solution |
| 43 | +Unfortunately this way of working does not respect the privacy of the people contained within the |
| 44 | +dataset. All their personal details end up at another party. |
56 | 45 |
|
57 |
| -## Output |
58 |
| - |
59 |
| -```output |
60 |
| -[1] "This new lesson looks good" |
61 |
| -``` |
| 46 | +::::::::::::::::::::::::::::::::::::: challenge |
62 | 47 |
|
63 |
| -::::::::::::::::::::::::::::::::: |
| 48 | +## Other problems with copying data |
64 | 49 |
|
| 50 | +Discuss in groups what other issues you see with handling the data by copying everything into one |
| 51 | +central place. |
65 | 52 |
|
66 |
| -## Challenge 2: how do you nest solutions within challenge blocks? |
| 53 | +:::::::::::::::: solution |
67 | 54 |
|
68 |
| -:::::::::::::::::::::::: solution |
| 55 | +You might think of multiple issues. Some examples: |
69 | 56 |
|
70 |
| -You can add a line with at least three colons and a `solution` tag. |
| 57 | +- The original data owner loses control of the data |
| 58 | +- Results in multiple versions of the data |
| 59 | +- What to do when the data needs to be updated? |
| 60 | +- If there was consent in the first place, how can you retract consent? |
71 | 61 |
|
72 |
| -::::::::::::::::::::::::::::::::: |
73 |
| -:::::::::::::::::::::::::::::::::::::::::::::::: |
74 |
| - |
75 |
| -## Figures |
76 |
| - |
77 |
| -You can use standard markdown for static figures with the following syntax: |
78 |
| - |
79 |
| -`{alt='alt text for |
80 |
| -accessibility purposes'}` |
| 62 | +::::::::::::::::::::::::: |
| 63 | +::::::::::::::::::::::::::::::::::::::::::::::: |
81 | 64 |
|
82 |
| -{alt='Blue Carpentries hex person logo with no text.'} |
| 65 | +## Data anonymization and pseudonymization |
83 | 66 |
|
84 |
| -::::::::::::::::::::::::::::::::::::: callout |
| 67 | +The first step in the process is often *data anonymization*. Personal identifiable information |
| 68 | +will in this case be removed so that individuals stay anonymous. Data *pseudonimization* is a |
| 69 | +similar process, but in this case, the records will be assigned an id that will make it |
| 70 | +possible to link individuals across datasets. |
85 | 71 |
|
86 |
| -Callout sections can highlight information. |
87 |
| - |
88 |
| -They are sometimes used to emphasise particularly important points |
89 |
| -but are also used in some lessons to present "asides": |
90 |
| -content that is not central to the narrative of the lesson, |
91 |
| -e.g. by providing the answer to a commonly-asked question. |
92 |
| - |
93 |
| -:::::::::::::::::::::::::::::::::::::::::::::::: |
| 72 | +{alt='Names are censored before the datasets |
| 73 | +are sent to a central place'} |
94 | 74 |
|
| 75 | +While data anonymization and pseudonymization are often a good first step, there is no guarantee |
| 76 | +that the data will never be reidentified. A famous example of reidentification is the story of the |
| 77 | +[Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize). The Netflix prize was an open |
| 78 | +competition to build the best recommender system to predict user ratings for films based on previous |
| 79 | +ratings. The data was anonymized, but in 2007 two researchers from The University of Texas at Austin |
| 80 | +were able to identify a large number of users by matching the dataset with film ratings on the |
| 81 | +Internet Movie Database (IMDB). |
95 | 82 |
|
96 |
| -## Math |
| 83 | +## Federated data analysis |
97 | 84 |
|
98 |
| -One of our episodes contains $\LaTeX$ equations when describing how to create |
99 |
| -dynamic reports with {knitr}, so we now use mathjax to describe this: |
| 85 | +There are different ways in which privacy risks can be mitigated. We will focus on the idea of |
| 86 | +federated analysis. In a federated setting, the data with the data owner, who keeps full control |
| 87 | +over it. In this case, it is not the data that travels, but the analysis itself. The system sends |
| 88 | +a query or instruction to the data and only the results will get back to the user. |
| 89 | +The results are often akin to a form of *aggregation* of the data. This can be in the shape of |
| 90 | +traditional |
| 91 | +statistics like the mean, or it can be more intricate like a machine learning model. |
100 | 92 |
|
101 |
| -`$\alpha = \dfrac{1}{(1 - \beta)^2}$` becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$ |
| 93 | +Aggregating the data does not ensure complete protection of person-level information, but it |
| 94 | +certainly makes it less likely that this will happen. |
102 | 95 |
|
103 |
| -Cool, right? |
| 96 | +TODO: Example of data leakage in simple aggregated case |
104 | 97 |
|
105 |
| -::::::::::::::::::::::::::::::::::::: keypoints |
| 98 | +## Federated learning |
106 | 99 |
|
107 |
| -- Use `.md` files for episodes when you want static content |
108 |
| -- Use `.Rmd` files for episodes when you need to generate output |
109 |
| -- Run `sandpaper::check_lesson()` to identify any issues with your lesson |
110 |
| -- Run `sandpaper::build_lesson()` to preview your lesson locally |
| 100 | +The term federated learning was introduced in 2016 by researchers at Google |
| 101 | +[(McMahan et al.)](https://doi.org/10.48550/arXiv.1602.05629) and refers to a "loose federation of |
| 102 | +participating devices (which we refer to as clients) which are coordinated by a central server.” In |
| 103 | +traditional federated learning, the clients train machine learning models, and only the updates of |
| 104 | +the models are sent back to the central server. The central server combines the updates from all the |
| 105 | +individual clients into one final machine learning model. |
| 106 | + |
| 107 | +There are caveats to using this type of data analysis though. Although the data transmitted from the |
| 108 | +clients to the server are an aggregation of the raw data, researchers have found a way to use this |
| 109 | +data to reconstruct the original data. This vulnerability is called *gradient* leakage. |
| 110 | + |
| 111 | + |
| 112 | + |
| 113 | +## Secure Multiparty Computation |
| 114 | + |
| 115 | +There are different solutions to prevent the reconstruction of raw data. One solution is to make |
| 116 | +sure that no party other than the data owner is actually able to see the intermediate data. One |
| 117 | +branch |
| 118 | +of techniques that can be used for this is Secure Multiparty Computation (MPC). With MPC, |
| 119 | +computations are performed collaboratively by multiple parties. Data is encrypted in such a way that |
| 120 | +other parties cannot see the original values, but values of multiple parties can still be combined ( |
| 121 | +e.g. added or |
| 122 | +multiplied). |
| 123 | +A classic technique from the field of MPC is secret sharing. With this technique data is encrypted, |
| 124 | +after which pieces of the encryption are sent to the other parties. No single party will be able to |
| 125 | +reconstruct the original value. Only when all parties work together, the original value can be |
| 126 | +retrieved. |
| 127 | + |
| 128 | +When combining multiple values using secret sharing, this will result in the parties owning new |
| 129 | +puzzle pieces that when put together will reveal the result of the computation. |
| 130 | + |
| 131 | +::: callout |
| 132 | + |
| 133 | +### Secret sharing, an example |
| 134 | + |
| 135 | +Mees, Sara and Noor want to know how much they weigh in total. |
| 136 | +Mees weighs 43 kg, Sara weighs 39, Noor weighs 45. |
| 137 | +They create secret shares for their weights that they give to their peers. |
| 138 | + |
| 139 | +| | Mees receives | Sara receives | Noor receives | Sum | |
| 140 | +|-----------------|---------------|---------------|---------------|-----| |
| 141 | +| Mees generates: | -11 | 50 | 4 | 43 | |
| 142 | +| Sara generates: | -12 | 17 | 34 | 39 | |
| 143 | +| Noor generates: | -19 | -38 | 64 | 45 | |
| 144 | + |
| 145 | +They sum their shares: |
| 146 | + |
| 147 | +| | | |
| 148 | +|------|-----| |
| 149 | +| Mees | -4 | |
| 150 | +| Sara | 29 | |
| 151 | +| Noor | 102 | |
| 152 | + |
| 153 | +They add their sums together: -4 + 29 + 102 = 127 |
| 154 | +In this way, they have aggregated their data without sharing their individual data with anyone else. |
| 155 | +::: |
| 156 | +TODO: Exercise with secret sharing where data is leaked. |
| 157 | + |
| 158 | +## Differential privacy |
| 159 | + |
| 160 | +As mentioned before, aggregation of data will not always prevent leaks of sensitive information. |
| 161 | +Consider the example of Mees, Sara and Noor. We know their total weight is 127 kg. If Sara and Noor |
| 162 | +get together and subtract their weights off of the total, they will be able to infer how much Mees |
| 163 | +weighs. |
| 164 | + |
| 165 | +An aggregation is differentially private when someone cannot infer whether a particular individual |
| 166 | +was used in the computation. A way to make a result more differentially private is to replace a |
| 167 | +selection of inputs with random noise. A single individual will then always be able to deny that |
| 168 | +their data has contributed to the final result. An individual has *plausible deniability* with |
| 169 | +regards to whether it was |
| 170 | +part of the dataset. |
| 171 | + |
| 172 | +## Blocks upon blocks |
| 173 | +The previously mentioned techniques are not used in isolation, but are usually stacked on top of |
| 174 | +eachother to mitigate the privacy risks that are relevant within a certain usecase. |
| 175 | +Typically, the process begins by anonymizing or pseudonymizing the data. With vantage6, the data is |
| 176 | +then placed in a federated setting. You can use the existing algorithms available for vantage6, |
| 177 | +which often incorporate various privacy-enhancing techniques. |
| 178 | + |
| 179 | +## Data partitioning |
| 180 | +Data sharing challenges come in many different shapes and sizes, but in the end, the goal of the |
| 181 | +researchers is often to analyze data *as if* it were available in one big table in one place. |
| 182 | +There are 2 main ways in which the dataset can be separated over different sources: **horizontal** |
| 183 | +and **vertical** partioning. In horizontal partitioning, this giant table has been snipped in pieces |
| 184 | +by making horizontal cuts. The result is that information of an individual record will stay in one |
| 185 | +place, but the records themselves have been scattered around in different locations. |
| 186 | + |
| 187 | +In vertical partitioning, the cuts have been made vertically. Columns have now been divided over |
| 188 | +different locations. This type of partitioning is usually more challenging because often a way needs |
| 189 | +to be found to link identities across datasources. Vertical partitioning requires different types |
| 190 | +of privacy enhancing algorithms than horizontal partitioning. |
| 191 | + |
| 192 | +In reality, data can be horizontally and vertically partitioned at the same time. It might be |
| 193 | +necessary to combine multiple techniques in order to overcome your problems. |
| 194 | + |
| 195 | +## Technology doesn't solve everything |
| 196 | + |
| 197 | +You have now learned about various technologies for analyzing data while preserving privacy of |
| 198 | +individuals. However, it should be emphasized that these technologies do not solve all your data |
| 199 | +sharing problems. Rather, they are only a small piece of the puzzle. In research projects involving |
| 200 | +privacy enhancing technologies, a lot of work goes into complying with regulations and building |
| 201 | +trust. |
| 202 | + |
| 203 | +Since these projects have a risk of affecting the privacy of individuals, a Data Protection Impact |
| 204 | +Assessment (DPIA) |
| 205 | +is usually required. This is a process that will help identify and minimize privacy risks of a |
| 206 | +project |
| 207 | +and is required by GDPR. |
| 208 | + |
| 209 | +Apart from procedures required by GDPR there might be other regulations in place enforced by the |
| 210 | +owners of the data (e.g. hospitals). The specific situation of a project can affect the way in which |
| 211 | +the data is allowed to be processed. Some privacy enhancing technologies might be allowed in one |
| 212 | +project but prohibited in another. It is always important to stay transparent about privacy risks |
| 213 | +of the technologies you intend to use. |
| 214 | + |
| 215 | +::::::::::::::::::::::::::::::::::::: keypoints |
| 216 | + |
| 217 | +- TODO |
111 | 218 |
|
112 | 219 | ::::::::::::::::::::::::::::::::::::::::::::::::
|
113 | 220 |
|
|
0 commit comments