PhD research

Simulate data

Simulate identifiers (name, family name, country, sex, birth year)

Get names, family names, gender (that will be considered to be the sex at birth) from 32 European countries using these datasets (Facebook Data Leak, 2019)
Match countries 2 letters code based on this dict
Generate age distributions based on: the World Bank (helped with Chatgpt), the World Health Organization (helped with Chatgpt), the World Factbook
Use realistic population sizes from: United Nations Department of Economic and Social Affairs, Eurostat and the World Bank (helped with Chatgpt)

Only in python, see:

.
└── simulate data
    └── python
        └── generate identifiers.ipynb

Generate covariates, treatment and outcome

Generate covariates, (age + 4 gaussian variables with different parameters)
Generate treatment $T \sim Bernoulli(p=0.4)$
Generate outcome (either fixed or non-fixed treatment effect over individuals)
- $Y = -10 + a T + b \exp(X_{4}) + c X_{3} X_{1} + d X_{5}$
- $Y = -10 + a T X_{2} + b \exp(X_{4}) + c X_{3} X_{1} + d X_{5}$

In python and R, see:

.
└── simulate data
    ├── python
    |   └── generate association.ipynb
    └── R
        └── generate association.R

Replicate Estimate-Tethered Stopping Rule algorithm

A Minimum Estimated Variance algorithm developed in Simultaneous record linkage and causal inference with propensity score subclassification

In python and R, see:

.
└── replicate MEV
    ├── python
    |   └── launch MEV algo.ipynb
    └── R
        └── launch MEV algo.R

Results

The context studied in the paper is the one of an additive treatment effect (treatment effect is the same for all individuals).

See images of simulations results (python):

.
└── images
    ├── fixed treatment effect
    |   ├── ate
    |   └── variance
    └── non-fixed treatment effect
        ├── ate
        └── variance

We can see the evolution of the estimated variance of the treatment effect and the evolution of the estimated average treatment effect through the linked records sets. Linked records sets (on the x-axis) are in decreasing order of links confidence which means that the first set (0) relies almost only on true links and in other sets we add more and more recorded links (which are less and less true links).

Solid blue lines represent the average evolution through the sets for 10 rounds (of simulating data + applying MEV algorithm), surounding shaded areas represent the 95% confidence intervals (over the 10 rounds) through the sets of linked records. Solid orange line represents the treatment effect value we are trying to recover.

We first observe that the algorithm works in a non-fixed treatment effect setting (when treatment effect is different for each individual) although the paper has been written in a fixed treatment effect context. We then notice an elbow on the variance plots (for both designs fixed and non-fixed treatment effect) that points out the best average treatment effect estimation, which we get for an early set (relying almost only on true links).

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
datasets		datasets
images		images
replicate Guha		replicate Guha
replicate MEV		replicate MEV
simulate data		simulate data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhD research

Simulate data

Simulate identifiers (name, family name, country, sex, birth year)

Generate covariates, treatment and outcome

Replicate Estimate-Tethered Stopping Rule algorithm

Results

About

Releases

Packages

Languages

robachowyk/PhD-research

Folders and files

Latest commit

History

Repository files navigation

PhD research

Simulate data

Simulate identifiers (name, family name, country, sex, birth year)

Generate covariates, treatment and outcome

Replicate Estimate-Tethered Stopping Rule algorithm

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages