vantage6
diff --git a/‎fig/classic analysis.jpg
62.9 KB b/‎fig/classic analysis.jpg
62.9 KB
diff --git a/‎fig/data_anonymization.jpg
57.9 KB b/‎fig/data_anonymization.jpg
57.9 KB
diff --git a/‎fig/gradient_leakage.jpg
143 KB b/‎fig/gradient_leakage.jpg
143 KB
diff --git a/‎introduction.md
Lines changed: 182 additions & 75 deletions b/‎introduction.md
Lines changed: 182 additions & 75 deletions
diff --git a/‎md5sum.txt
Lines changed: 1 addition & 1 deletion b/‎md5sum.txt
Lines changed: 1 addition & 1 deletion
@@ -1,113 +1,220 @@
 ---
-title: "Using Markdown"
+title: "Introduction to privacy enhancing technologies (PET)"
 teaching: 10
 exercises: 2
 ---
 
-:::::::::::::::::::::::::::::::::::::: questions 
+:::::::::::::::::::::::::::::::::::::: questions
 
-- How do you write a lesson using Markdown and `{sandpaper}`?
+- TODO
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Explain how to use markdown with The Carpentries Workbench
-- Demonstrate how to include pieces of code, figures, and nested challenge blocks
-
+- Understand PET, FL, MPC, homomorphic encryption, differential privacy
+- Understand how different PET techniques relate
+- Understand scenarios where PET could be applied
+- Understand horizontal vs vertical partitioning
+- Decompose a simple analysis in a federated way
+- Understand that there is paperwork to be done (DPIA etc.)
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
-## Introduction
-
-This is a lesson created via The Carpentries Workbench. It is written in
-[Pandoc-flavored Markdown](https://pandoc.org/MANUAL.txt) for static files and
-[R Markdown][r-markdown] for dynamic files that can render code into output. 
-Please refer to the [Introduction to The Carpentries 
-Workbench](https://carpentries.github.io/sandpaper-docs/) for full documentation.
-
-What you need to know is that there are three sections required for a valid
-Carpentries lesson:
-
- 1. `questions` are displayed at the beginning of the episode to prime the
-    learner for the content.
- 2. `objectives` are the learning objectives for an episode displayed with
-    the questions.
- 3. `keypoints` are displayed at the end of the episode to reinforce the
-    objectives.
-
-:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor
-
-Inline instructor notes can help inform instructors of timing challenges
-associated with the lessons. They appear in the "Instructor View"
-
-::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+## Problem statement
 
-::::::::::::::::::::::::::::::::::::: challenge 
+The amount of data being generated nowadays is absolutely mind-boggling. This data can be a valuable
+resource for researchers. However, personal data should be handled with great care and
+responsibility
+because of its sensitive nature. This is why there are privacy regulations in place like
+[GDPR](https://gdpr-info.eu/) to prohibit easy access to this wealth of data.
 
-## Challenge 1: Can you do it?
+However, often researchers are not interested in the personal records that make up the data, but
+rather
+in the *insights* derived from it. This raises an intriguing question: Can we unlock these valuable
+insights in a manner that upholds and respects privacy standards?
 
-What is the output of this command?
+In classic data analysis, all data is copied over into a single place. This makes it very easy to
+use
+conventional data analysis software and tools to gain insights.
 
-```r
-paste("This", "new", "lesson", "looks", "good")
-```
+![In classic analysis all the data is brought together](fig/classic%20analysis.jpg){alt='Two tables
+with data are moved to a central location'}
 
-:::::::::::::::::::::::: solution 
+Unfortunately this way of working does not respect the privacy of the people contained within the
+dataset. All their personal details end up at another party.
 
-## Output
- 
-```output
-[1] "This new lesson looks good"
-```
+::::::::::::::::::::::::::::::::::::: challenge
 
-:::::::::::::::::::::::::::::::::
+## Other problems with copying data
 
+Discuss in groups what other issues you see with handling the data by copying everything into one
+central place.
 
-## Challenge 2: how do you nest solutions within challenge blocks?
+:::::::::::::::: solution
 
-:::::::::::::::::::::::: solution 
+You might think of multiple issues. Some examples:
 
-You can add a line with at least three colons and a `solution` tag.
+- The original data owner loses control of the data
+- Results in multiple versions of the data
+- What to do when the data needs to be updated?
+- If there was consent in the first place, how can you retract consent?
 
-:::::::::::::::::::::::::::::::::
-::::::::::::::::::::::::::::::::::::::::::::::::
-
-## Figures
-
-You can use standard markdown for static figures with the following syntax:
-
-`![optional caption that appears below the figure](figure url){alt='alt text for
-accessibility purposes'}`
+:::::::::::::::::::::::::
+:::::::::::::::::::::::::::::::::::::::::::::::
 
-![You belong in The Carpentries!](https://raw.githubusercontent.com/carpentries/logo/master/Badge_Carpentries.svg){alt='Blue Carpentries hex person logo with no text.'}
+## Data anonymization and pseudonymization
 
-::::::::::::::::::::::::::::::::::::: callout
+The first step in the process is often *data anonymization*. Personal identifiable information
+will in this case be removed so that individuals stay anonymous. Data *pseudonimization* is a
+similar process, but in this case, the records will be assigned an id that will make it
+possible to link individuals across datasets.
 
-Callout sections can highlight information.
-
-They are sometimes used to emphasise particularly important points
-but are also used in some lessons to present "asides": 
-content that is not central to the narrative of the lesson,
-e.g. by providing the answer to a commonly-asked question.
-
-::::::::::::::::::::::::::::::::::::::::::::::::
+![Data anonymization](fig%2Fdata_anonymization.jpg){alt='Names are censored before the datasets
+are sent to a central place'}
 
+While data anonymization and pseudonymization are often a good first step, there is no guarantee
+that the data will never be reidentified. A famous example of reidentification is the story of the
+[Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize). The Netflix prize was an open
+competition to build the best recommender system to predict user ratings for films based on previous
+ratings. The data was anonymized, but in 2007 two researchers from The University of Texas at Austin
+were able to identify a large number of users by matching the dataset with film ratings on the
+Internet Movie Database (IMDB).
 
-## Math
+## Federated data analysis
 
-One of our episodes contains $\LaTeX$ equations when describing how to create
-dynamic reports with {knitr}, so we now use mathjax to describe this:
+There are different ways in which privacy risks can be mitigated. We will focus on the idea of
+federated analysis. In a federated setting, the data with the data owner, who keeps full control
+over it. In this case, it is not the data that travels, but the analysis itself. The system sends
+a query or instruction to the data and only the results will get back to the user.
+The results are often akin to a form of *aggregation* of the data. This can be in the shape of
+traditional
+statistics like the mean, or it can be more intricate like a machine learning model.
 
-`$\alpha = \dfrac{1}{(1 - \beta)^2}$` becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$
+Aggregating the data does not ensure complete protection of person-level information, but it
+certainly makes it less likely that this will happen.
 
-Cool, right?
+TODO: Example of data leakage in simple aggregated case
 
-::::::::::::::::::::::::::::::::::::: keypoints 
+## Federated learning
 
-- Use `.md` files for episodes when you want static content
-- Use `.Rmd` files for episodes when you need to generate output
-- Run `sandpaper::check_lesson()` to identify any issues with your lesson
-- Run `sandpaper::build_lesson()` to preview your lesson locally
+The term federated learning was introduced in 2016 by researchers at Google
+[(McMahan et al.)](https://doi.org/10.48550/arXiv.1602.05629) and refers to a "loose federation of
+participating devices (which we refer to as clients) which are coordinated by a central server.” In 
+traditional federated learning, the clients train machine learning models, and only the updates of
+the models are sent back to the central server. The central server combines the updates from all the
+individual clients into one final machine learning model.
+
+There are caveats to using this type of data analysis though. Although the data transmitted from the
+clients to the server are an aggregation of the raw data, researchers have found a way to use this
+data to reconstruct the original data. This vulnerability is called *gradient* leakage.
+
+![An example of gradient leakage](fig/gradient_leakage.jpg)
+
+## Secure Multiparty Computation
+
+There are different solutions to prevent the reconstruction of raw data. One solution is to make
+sure that no party other than the data owner is actually able to see the intermediate data. One
+branch
+of techniques that can be used for this is Secure Multiparty Computation (MPC). With MPC,  
+computations are performed collaboratively by multiple parties. Data is encrypted in such a way that
+other parties cannot see the original values, but values of multiple parties can still be combined (
+e.g. added or
+multiplied).
+A classic technique from the field of MPC is secret sharing. With this technique data is encrypted,
+after which pieces of the encryption are sent to the other parties. No single party will be able to
+reconstruct the original value. Only when all parties work together, the original value can be
+retrieved.
+
+When combining multiple values using secret sharing, this will result in the parties owning new
+puzzle pieces that when put together will reveal the result of the computation.
+
+::: callout
+
+### Secret sharing, an example
+
+Mees, Sara and Noor want to know how much they weigh in total.
+Mees weighs 43 kg, Sara weighs 39, Noor weighs 45.
+They create secret shares for their weights that they give to their peers.
+
+|                 | Mees receives | Sara receives | Noor receives | Sum |
+|-----------------|---------------|---------------|---------------|-----|
+| Mees generates: | -11           | 50            | 4             | 43  |
+| Sara generates: | -12           | 17            | 34            | 39  |
+| Noor generates: | -19           | -38           | 64            | 45  |
+
+They sum their shares:
+
+|      |     |
+|------|-----|
+| Mees | -4  |
+| Sara | 29  |
+| Noor | 102 |
+
+They add their sums together: -4 + 29 + 102 = 127
+In this way, they have aggregated their data without sharing their individual data with anyone else.
+:::
+TODO: Exercise with secret sharing where data is leaked.
+
+## Differential privacy
+
+As mentioned before, aggregation of data will not always prevent leaks of sensitive information.
+Consider the example of Mees, Sara and Noor. We know their total weight is 127 kg. If Sara and Noor
+get together and subtract their weights off of the total, they will be able to infer how much Mees
+weighs.
+
+An aggregation is differentially private when someone cannot infer whether a particular individual
+was used in the computation. A way to make a result more differentially private is to replace a
+selection of inputs with random noise. A single individual will then always be able to deny that
+their data has contributed to the final result. An individual has *plausible deniability* with
+regards to whether it was
+part of the dataset.
+
+## Blocks upon blocks
+The previously mentioned techniques are not used in isolation, but are usually stacked on top of
+eachother to mitigate the privacy risks that are relevant within a certain usecase.
+Typically, the process begins by anonymizing or pseudonymizing the data. With vantage6, the data is
+then placed in a federated setting. You can use the existing algorithms available for vantage6,
+which often incorporate various privacy-enhancing techniques.
+
+## Data partitioning
+Data sharing challenges come in many different shapes and sizes, but in the end, the goal of the
+researchers is often to analyze data *as if* it were available in one big table in one place.
+There are 2 main ways in which the dataset can be separated over different sources: **horizontal**
+and **vertical** partioning. In horizontal partitioning, this giant table has been snipped in pieces
+by making horizontal cuts. The result is that information of an individual record will stay in one
+place, but the records themselves have been scattered around in different locations.
+
+In vertical partitioning, the cuts have been made vertically. Columns have now been divided over
+different locations. This type of partitioning is usually more challenging because often a way needs
+to be found to link identities across datasources. Vertical partitioning requires different types
+of privacy enhancing algorithms than horizontal partitioning.
+
+In reality, data can be horizontally and vertically partitioned at the same time. It might be
+necessary to combine multiple techniques in order to overcome your problems.
+
+## Technology doesn't solve everything
+
+You have now learned about various technologies for analyzing data while preserving privacy of
+individuals. However, it should be emphasized that these technologies do not solve all your data
+sharing problems. Rather, they are only a small piece of the puzzle. In research projects involving
+privacy enhancing technologies, a lot of work goes into complying with regulations and building
+trust.
+
+Since these projects have a risk of affecting the privacy of individuals, a Data Protection Impact
+Assessment (DPIA)
+is usually required. This is a process that will help identify and minimize privacy risks of a
+project
+and is required by GDPR.
+
+Apart from procedures required by GDPR there might be other regulations in place enforced by the
+owners of the data (e.g. hospitals). The specific situation of a project can affect the way in which
+the data is allowed to be processed. Some privacy enhancing technologies might be allowed in one
+project but prohibited in another. It is always important to stay transparent about privacy risks
+of the technologies you intend to use.
+
+::::::::::::::::::::::::::::::::::::: keypoints
+
+- TODO
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 
@@ -5,7 +5,7 @@
 "index.md" "a02c9c785ed98ddd84fe3d34ddb12fcd" "site/built/index.md" "2022-04-22"
 "links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2022-04-22"
 "notes.md" "7f1c9fbd8d8ae2649784ae52350da772" "site/built/notes.md" "2024-03-27"
-"episodes/introduction.md" "6c55d31b41d322729fb3276f8d4371fc" "site/built/introduction.md" "2023-07-24"
+"episodes/introduction.md" "08c3b2fcca2feb44b59a2679a796239c" "site/built/introduction.md" "2024-06-04"
 "instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2023-03-16"
 "learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2023-03-16"
 "learners/setup.md" "5456593e4a75491955ac4a252c05fbc9" "site/built/setup.md" "2024-01-26"