Skip to content

Commit 386750a

Browse files
committed
markdown source builds
Auto-generated via {sandpaper} Source : b5bc759 Branch : main Author : Djura <djura.smits@gmail.com> Time : 2024-06-04 09:03:12 +0000 Message : Merge pull request #16 from vantage6/3-feature-request-draft-episode-1-basic-concepts-of-pet Basic concepts of pet
1 parent 0d7fbd7 commit 386750a

File tree

5 files changed

+183
-76
lines changed

5 files changed

+183
-76
lines changed

fig/classic analysis.jpg

62.9 KB
Loading

fig/data_anonymization.jpg

57.9 KB
Loading

fig/gradient_leakage.jpg

143 KB
Loading

introduction.md

Lines changed: 182 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,113 +1,220 @@
11
---
2-
title: "Using Markdown"
2+
title: "Introduction to privacy enhancing technologies (PET)"
33
teaching: 10
44
exercises: 2
55
---
66

7-
:::::::::::::::::::::::::::::::::::::: questions
7+
:::::::::::::::::::::::::::::::::::::: questions
88

9-
- How do you write a lesson using Markdown and `{sandpaper}`?
9+
- TODO
1010

1111
::::::::::::::::::::::::::::::::::::::::::::::::
1212

1313
::::::::::::::::::::::::::::::::::::: objectives
1414

15-
- Explain how to use markdown with The Carpentries Workbench
16-
- Demonstrate how to include pieces of code, figures, and nested challenge blocks
17-
15+
- Understand PET, FL, MPC, homomorphic encryption, differential privacy
16+
- Understand how different PET techniques relate
17+
- Understand scenarios where PET could be applied
18+
- Understand horizontal vs vertical partitioning
19+
- Decompose a simple analysis in a federated way
20+
- Understand that there is paperwork to be done (DPIA etc.)
1821
::::::::::::::::::::::::::::::::::::::::::::::::
1922

20-
## Introduction
21-
22-
This is a lesson created via The Carpentries Workbench. It is written in
23-
[Pandoc-flavored Markdown](https://pandoc.org/MANUAL.txt) for static files and
24-
[R Markdown][r-markdown] for dynamic files that can render code into output.
25-
Please refer to the [Introduction to The Carpentries
26-
Workbench](https://carpentries.github.io/sandpaper-docs/) for full documentation.
27-
28-
What you need to know is that there are three sections required for a valid
29-
Carpentries lesson:
30-
31-
1. `questions` are displayed at the beginning of the episode to prime the
32-
learner for the content.
33-
2. `objectives` are the learning objectives for an episode displayed with
34-
the questions.
35-
3. `keypoints` are displayed at the end of the episode to reinforce the
36-
objectives.
37-
38-
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
39-
40-
Inline instructor notes can help inform instructors of timing challenges
41-
associated with the lessons. They appear in the "Instructor View"
42-
43-
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
23+
## Problem statement
4424

45-
::::::::::::::::::::::::::::::::::::: challenge
25+
The amount of data being generated nowadays is absolutely mind-boggling. This data can be a valuable
26+
resource for researchers. However, personal data should be handled with great care and
27+
responsibility
28+
because of its sensitive nature. This is why there are privacy regulations in place like
29+
[GDPR](https://gdpr-info.eu/) to prohibit easy access to this wealth of data.
4630

47-
## Challenge 1: Can you do it?
31+
However, often researchers are not interested in the personal records that make up the data, but
32+
rather
33+
in the *insights* derived from it. This raises an intriguing question: Can we unlock these valuable
34+
insights in a manner that upholds and respects privacy standards?
4835

49-
What is the output of this command?
36+
In classic data analysis, all data is copied over into a single place. This makes it very easy to
37+
use
38+
conventional data analysis software and tools to gain insights.
5039

51-
```r
52-
paste("This", "new", "lesson", "looks", "good")
53-
```
40+
![In classic analysis all the data is brought together](fig/classic%20analysis.jpg){alt='Two tables
41+
with data are moved to a central location'}
5442

55-
:::::::::::::::::::::::: solution
43+
Unfortunately this way of working does not respect the privacy of the people contained within the
44+
dataset. All their personal details end up at another party.
5645

57-
## Output
58-
59-
```output
60-
[1] "This new lesson looks good"
61-
```
46+
::::::::::::::::::::::::::::::::::::: challenge
6247

63-
:::::::::::::::::::::::::::::::::
48+
## Other problems with copying data
6449

50+
Discuss in groups what other issues you see with handling the data by copying everything into one
51+
central place.
6552

66-
## Challenge 2: how do you nest solutions within challenge blocks?
53+
:::::::::::::::: solution
6754

68-
:::::::::::::::::::::::: solution
55+
You might think of multiple issues. Some examples:
6956

70-
You can add a line with at least three colons and a `solution` tag.
57+
- The original data owner loses control of the data
58+
- Results in multiple versions of the data
59+
- What to do when the data needs to be updated?
60+
- If there was consent in the first place, how can you retract consent?
7161

72-
:::::::::::::::::::::::::::::::::
73-
::::::::::::::::::::::::::::::::::::::::::::::::
74-
75-
## Figures
76-
77-
You can use standard markdown for static figures with the following syntax:
78-
79-
`![optional caption that appears below the figure](figure url){alt='alt text for
80-
accessibility purposes'}`
62+
:::::::::::::::::::::::::
63+
:::::::::::::::::::::::::::::::::::::::::::::::
8164

82-
![You belong in The Carpentries!](https://raw.githubusercontent.com/carpentries/logo/master/Badge_Carpentries.svg){alt='Blue Carpentries hex person logo with no text.'}
65+
## Data anonymization and pseudonymization
8366

84-
::::::::::::::::::::::::::::::::::::: callout
67+
The first step in the process is often *data anonymization*. Personal identifiable information
68+
will in this case be removed so that individuals stay anonymous. Data *pseudonimization* is a
69+
similar process, but in this case, the records will be assigned an id that will make it
70+
possible to link individuals across datasets.
8571

86-
Callout sections can highlight information.
87-
88-
They are sometimes used to emphasise particularly important points
89-
but are also used in some lessons to present "asides":
90-
content that is not central to the narrative of the lesson,
91-
e.g. by providing the answer to a commonly-asked question.
92-
93-
::::::::::::::::::::::::::::::::::::::::::::::::
72+
![Data anonymization](fig%2Fdata_anonymization.jpg){alt='Names are censored before the datasets
73+
are sent to a central place'}
9474

75+
While data anonymization and pseudonymization are often a good first step, there is no guarantee
76+
that the data will never be reidentified. A famous example of reidentification is the story of the
77+
[Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize). The Netflix prize was an open
78+
competition to build the best recommender system to predict user ratings for films based on previous
79+
ratings. The data was anonymized, but in 2007 two researchers from The University of Texas at Austin
80+
were able to identify a large number of users by matching the dataset with film ratings on the
81+
Internet Movie Database (IMDB).
9582

96-
## Math
83+
## Federated data analysis
9784

98-
One of our episodes contains $\LaTeX$ equations when describing how to create
99-
dynamic reports with {knitr}, so we now use mathjax to describe this:
85+
There are different ways in which privacy risks can be mitigated. We will focus on the idea of
86+
federated analysis. In a federated setting, the data with the data owner, who keeps full control
87+
over it. In this case, it is not the data that travels, but the analysis itself. The system sends
88+
a query or instruction to the data and only the results will get back to the user.
89+
The results are often akin to a form of *aggregation* of the data. This can be in the shape of
90+
traditional
91+
statistics like the mean, or it can be more intricate like a machine learning model.
10092

101-
`$\alpha = \dfrac{1}{(1 - \beta)^2}$` becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$
93+
Aggregating the data does not ensure complete protection of person-level information, but it
94+
certainly makes it less likely that this will happen.
10295

103-
Cool, right?
96+
TODO: Example of data leakage in simple aggregated case
10497

105-
::::::::::::::::::::::::::::::::::::: keypoints
98+
## Federated learning
10699

107-
- Use `.md` files for episodes when you want static content
108-
- Use `.Rmd` files for episodes when you need to generate output
109-
- Run `sandpaper::check_lesson()` to identify any issues with your lesson
110-
- Run `sandpaper::build_lesson()` to preview your lesson locally
100+
The term federated learning was introduced in 2016 by researchers at Google
101+
[(McMahan et al.)](https://doi.org/10.48550/arXiv.1602.05629) and refers to a "loose federation of
102+
participating devices (which we refer to as clients) which are coordinated by a central server.” In
103+
traditional federated learning, the clients train machine learning models, and only the updates of
104+
the models are sent back to the central server. The central server combines the updates from all the
105+
individual clients into one final machine learning model.
106+
107+
There are caveats to using this type of data analysis though. Although the data transmitted from the
108+
clients to the server are an aggregation of the raw data, researchers have found a way to use this
109+
data to reconstruct the original data. This vulnerability is called *gradient* leakage.
110+
111+
![An example of gradient leakage](fig/gradient_leakage.jpg)
112+
113+
## Secure Multiparty Computation
114+
115+
There are different solutions to prevent the reconstruction of raw data. One solution is to make
116+
sure that no party other than the data owner is actually able to see the intermediate data. One
117+
branch
118+
of techniques that can be used for this is Secure Multiparty Computation (MPC). With MPC,
119+
computations are performed collaboratively by multiple parties. Data is encrypted in such a way that
120+
other parties cannot see the original values, but values of multiple parties can still be combined (
121+
e.g. added or
122+
multiplied).
123+
A classic technique from the field of MPC is secret sharing. With this technique data is encrypted,
124+
after which pieces of the encryption are sent to the other parties. No single party will be able to
125+
reconstruct the original value. Only when all parties work together, the original value can be
126+
retrieved.
127+
128+
When combining multiple values using secret sharing, this will result in the parties owning new
129+
puzzle pieces that when put together will reveal the result of the computation.
130+
131+
::: callout
132+
133+
### Secret sharing, an example
134+
135+
Mees, Sara and Noor want to know how much they weigh in total.
136+
Mees weighs 43 kg, Sara weighs 39, Noor weighs 45.
137+
They create secret shares for their weights that they give to their peers.
138+
139+
| | Mees receives | Sara receives | Noor receives | Sum |
140+
|-----------------|---------------|---------------|---------------|-----|
141+
| Mees generates: | -11 | 50 | 4 | 43 |
142+
| Sara generates: | -12 | 17 | 34 | 39 |
143+
| Noor generates: | -19 | -38 | 64 | 45 |
144+
145+
They sum their shares:
146+
147+
| | |
148+
|------|-----|
149+
| Mees | -4 |
150+
| Sara | 29 |
151+
| Noor | 102 |
152+
153+
They add their sums together: -4 + 29 + 102 = 127
154+
In this way, they have aggregated their data without sharing their individual data with anyone else.
155+
:::
156+
TODO: Exercise with secret sharing where data is leaked.
157+
158+
## Differential privacy
159+
160+
As mentioned before, aggregation of data will not always prevent leaks of sensitive information.
161+
Consider the example of Mees, Sara and Noor. We know their total weight is 127 kg. If Sara and Noor
162+
get together and subtract their weights off of the total, they will be able to infer how much Mees
163+
weighs.
164+
165+
An aggregation is differentially private when someone cannot infer whether a particular individual
166+
was used in the computation. A way to make a result more differentially private is to replace a
167+
selection of inputs with random noise. A single individual will then always be able to deny that
168+
their data has contributed to the final result. An individual has *plausible deniability* with
169+
regards to whether it was
170+
part of the dataset.
171+
172+
## Blocks upon blocks
173+
The previously mentioned techniques are not used in isolation, but are usually stacked on top of
174+
eachother to mitigate the privacy risks that are relevant within a certain usecase.
175+
Typically, the process begins by anonymizing or pseudonymizing the data. With vantage6, the data is
176+
then placed in a federated setting. You can use the existing algorithms available for vantage6,
177+
which often incorporate various privacy-enhancing techniques.
178+
179+
## Data partitioning
180+
Data sharing challenges come in many different shapes and sizes, but in the end, the goal of the
181+
researchers is often to analyze data *as if* it were available in one big table in one place.
182+
There are 2 main ways in which the dataset can be separated over different sources: **horizontal**
183+
and **vertical** partioning. In horizontal partitioning, this giant table has been snipped in pieces
184+
by making horizontal cuts. The result is that information of an individual record will stay in one
185+
place, but the records themselves have been scattered around in different locations.
186+
187+
In vertical partitioning, the cuts have been made vertically. Columns have now been divided over
188+
different locations. This type of partitioning is usually more challenging because often a way needs
189+
to be found to link identities across datasources. Vertical partitioning requires different types
190+
of privacy enhancing algorithms than horizontal partitioning.
191+
192+
In reality, data can be horizontally and vertically partitioned at the same time. It might be
193+
necessary to combine multiple techniques in order to overcome your problems.
194+
195+
## Technology doesn't solve everything
196+
197+
You have now learned about various technologies for analyzing data while preserving privacy of
198+
individuals. However, it should be emphasized that these technologies do not solve all your data
199+
sharing problems. Rather, they are only a small piece of the puzzle. In research projects involving
200+
privacy enhancing technologies, a lot of work goes into complying with regulations and building
201+
trust.
202+
203+
Since these projects have a risk of affecting the privacy of individuals, a Data Protection Impact
204+
Assessment (DPIA)
205+
is usually required. This is a process that will help identify and minimize privacy risks of a
206+
project
207+
and is required by GDPR.
208+
209+
Apart from procedures required by GDPR there might be other regulations in place enforced by the
210+
owners of the data (e.g. hospitals). The specific situation of a project can affect the way in which
211+
the data is allowed to be processed. Some privacy enhancing technologies might be allowed in one
212+
project but prohibited in another. It is always important to stay transparent about privacy risks
213+
of the technologies you intend to use.
214+
215+
::::::::::::::::::::::::::::::::::::: keypoints
216+
217+
- TODO
111218

112219
::::::::::::::::::::::::::::::::::::::::::::::::
113220

md5sum.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2022-04-22"
66
"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2022-04-22"
77
"notes.md" "7f1c9fbd8d8ae2649784ae52350da772" "site/built/notes.md" "2024-03-27"
8-
"episodes/introduction.md" "6c55d31b41d322729fb3276f8d4371fc" "site/built/introduction.md" "2023-07-24"
8+
"episodes/introduction.md" "08c3b2fcca2feb44b59a2679a796239c" "site/built/introduction.md" "2024-06-04"
99
"instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2023-03-16"
1010
"learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2023-03-16"
1111
"learners/setup.md" "5456593e4a75491955ac4a252c05fbc9" "site/built/setup.md" "2024-01-26"

0 commit comments

Comments
 (0)