This repository is making an RO-Crate of the BY-COVID WP5 T5.2 baseline use case on "SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection".
The RO-Crate can be browsed at https://by-covid.github.io/BY-COVID_WP5_T5.2_baseline-use-case/ (HTML) and as ro-crate-metadata.json
Task 5.2 aims to demonstrate how mobilisation of real-world population, health and care data across national borders can provide answers to policy-relevant research questions. Eventually, it aims to prototype a workflow that is standard for population health research. Here, the research question is approached by identifying a causal effect that allows to evaluate a public health intervention. As such, a methodology for approaching causal inference when conducting federated research is proposed and demonstrated, guaranteeing different layers of interoperability (i.e., legal, organisational, and semantic interoperability).
The methodological framework comprises the following steps:
- Defining the research question and the exposure-outcome relationship
- Establishing a causal model using Directed Acyclic Graphs (DAGs)
- Translating the causal model into data requirements using a Common Data Model (CDM)
- Generating synthetic data and developing an interoperable analytical pipeline
- Mobilising individual-level data within each of the nodes and transforming the data to comply with the CDM
- Deploying the interoperable analytical pipeline within the secure processing environment (SPE) of each of the sites (Participant Nodes)
- Comparative analysis of the aggregated results
The current use case aims to answer the following research question: "How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?"
For more information, please consult the study protocol.
The current repository contains the following pieces:
- The Study Protocol (vaccine_effectiveness_study_protocol)
- Study protocol (PDF)
- The Causal Model (vaccines_effectiveness_causal_model)
- Interactive report showcasing the structural causal model as a DAG (HTML)
- Script to produce the causal model as a DAG (QMD)
- The Common Data Model specification (vaccines_effectiveness_common_data_model_specification)
- Human-readable common data model specification (XLSX)
- Human-readable metadata file (schema.org compliant) created with dataspice as an interactive report (HTML)
- Machine-readable structured metadata file (schema.org compliant) created with dataspice (JSON)
- The Synthetic Data (vaccines_effectiveness_synthetic_dataset)
- Script to generate the synthetic dataset (IPYNB)
- Synthetic dataset for 650,000 registries (CSV)
- Machine-readable version of the Exploratory Data Analysis (EDA) of the synthetic data (JSON)
- Interactive report of the Exploratory Data Analysis (EDA) of the synthetic data (HTML)
- The Analytical Pipeline (vaccine_effectiveness_analytical_pipeline)
- Documentation of the methodology of the scripts in the analytical pipeline (PDF)
- Scripts of the analytical pipeline
- Upload the csv file and create a DuckDB database file (R)
- Exploration and Data Quality Assessment (DQA) of the original data mapped at each Data Hub (QMD)
- Validation (applying logic validation rules) of the original data mapped at each Data Hub (QMD)
- Imputation of missing values in the original data mapped at each Data Hub (QMD)
- Iterative matching of the exposed and unexposed (QMD)
- Descriptive analysis of the matched and unmatched study population (QMD)
- Survival analysis of the matched study population (QMD)
- Input for the analytical pipeline (CSV)
- Log file created when running the analytical pipeline (TXT)
- Ouput of the analytical pipeline 1
- Interactive report with dataset statistics, missing data profiles, potential alerts, and detailed per-variable information (HTML)
- Interactive report summarizing compliance with the logic validation rules (HTML)
- Interactive report summarizing used imputation methods and number of imputed values
- Interactive report to assess matching balance in the obtained study population (HTML)
- Interactive report including the results of survival analysis in the unmatched study population, a table with baseline characteristics of the matched study population and CONSORT diagram (HTML)
- Interactive report including the analytic results (HTML)
- Aggregated non-sensitive analytic results for the comparative analysis (XLSX)
The first set of digital objects support conceptually what is needed to develop the analytical pipeline and have been published together as the Common Data Model Specification.
A research team constructs a causal model responding to the proposed research question. A Quarto RMarkdown script (vaccine_effectiveness.QMD
) produces the structural causal model (DAG) in an interactive HTML report (vaccine_effectiveness_causal_model.html
).
A human translates the causal model into data requirements (no technical link) and constructs a CDM in a human-readable version (vaccine_effectiveness_data_model_specification.xlsx
). A metadata file compliant to schema.org is produced using dataspice in both human-readable (vaccine_effectiveness_synthetic_dataset_spice.html
) and machine-readable (dataspice.json
) format.
The data model specification is translated by a human (no technical link) into a Python script (Jupyter notebook) to generate the synthetic data (by-covid_wp5_baseline_generate_synthetic_data_v.1.1.0.ipynb
). The script outputs (technical link) the synthetic data (vaccines_effectiveness_synthetic_dataset_pop_650k.csv
). An interactive report with the Exploratory Data Analysis (EDA) of the synthetic data is created using pandas-profiling (now ydata-profiling) in both human-readable (vaccine_effectiveness_synthetic_dataset_eda.html
) and machine-readable format (vaccine_effectiveness_synthetic_dataset_eda.json
).
The next set of digital objects are the consecutive scripts of the analytical pipeline. The individual scripts of the analytical pipeline are technically linked to each other. More information on the methodology can be found in the documentation. For illustrative purposes, the interactive reports as output of the analytical pipeline when applied to the synthetic dataset are provided.
Script: 0_global.R
- Input:
vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
- Output:
cohort_data
A DuckDB database file is created (BY-COVID-WP5-BaselineUseCase-VE.duckdb
). Data are imported from a csv file (e.g. vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
) using the R package Arrow
and inserted into the cohort_data
database table within the BY-COVID-WP5-BaselineUseCase-VE.duckdb
. Data types are manually specified according to the Common Data Model Specification when reading the data using a schema.
Script: 1_DQA.QMD
- Input:
cohort_data
- Output:
DQA.html
A data quality assessment on the cohort_data
is performed and an interactive html report (DQA.html
) is created. This report provides an overview of the data and includes dataset statistics, variable types, missing data profiles and potential alerts.
Script: 2_validation.QMD
- Input:
cohort_data
- Output:
cohort_data
includingflag_violation_val
validation.html
The cohort_data
are tested against a set of validation rules (as specified in the Common Data Model Specification) and the results of this validation process are summarised in an interactive html report (validation.html
). A logical variable flag_violation_val
is created in the cohort_data
table in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database and set to TRUE
when at least one of the validation rules in the pre-specified set is violated (otherwise this variable is set to FALSE
).
Script: 3_imputation.QMD
- Input:
cohort_data
- Output:
cohort_data
includingflag_listwise_del
cohort_data_imputed
imputation_methods
imputation.html
For each variable in the cohort_data
different checks are conducted, based on which a decision is made on how to handle missing values. A logical variable flag_listwise_del
is created in the cohort_data
table in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database and set to TRUE
for records for which the value of this variable is missing and the imputation_method=='Listwise deletion where core variable has missing values (MCAR reasonable)'
. Imputation of missing values of variables for which is was decided to impute was conducted using the R package mice
resulting in an imputed dataset. From this dataset, the records with imputed values are filtered and saved in a separate database table cohort_data_imputed
in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database. Variables with a high degree of missingness are not included as a matching variable. A report (imputation.html
) is generated summarising the results of the different checks and methods used for dealing with missing values.
Script: 4_matching.QMD
(sourcing 4_matching.R
)
- Input:
cohort_data
includingflag_violation_val
andflag_listwise_del
cohort_data_imputed
imputation_methods
- Output:
group_similarity
result_matching_alg
matched_data
matching.html
In the script 4_matching.R, variables needed for the matching are created based on existing variables in the cohort_data
and cohort_data_imputed
. Records from individuals with a previous infection (previous_infection_bl==TRUE
), records violating one of the 'essential' validation rules (flag_violation_val==TRUE
) and records set to be listwise deleted (flag_listwise_del==TRUE
) are excluded. The matching is conducted using the R package MatchIt
. A new table group_similarity
is created in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database containing for each group_id
the 10 nearest matched groups and corresponding distances. The matching algorithm iterates over the set of unique days during the enrollment period at which a newly vaccinated individual (i.e. completing a primary vaccination schedule) is identified. The results obtained for each date are appended to a database table result_matching_alg
in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database in which one record corresponds to one matched pair. A new table matched_data
is subsequently created in the BY-COVID-WP5-BaselineUseCase-VE.duckdb
DuckDB database, with two records per match (i.e., one for the case and one for the control). After matching (termination 4_matching.R), the covariate balance is assessed and summarised in an interactive report (matching.html
).
Script: 5_descriptives.QMD
- Input:
cohort_data
includingflag_violation_val
andflag_listwise_del
cohort_data_imputed
matched_data
imputation_methods
- Output:
descriptive.html
The descriptive analysis contains four elements which are reported in descriptive.html
: a description of the considered time periods (data extraction period, enrollment period and study period), the results of a survival analysis in the unmatched population (adjusted and unadjusted), a flowchart describing the study population selection (CONSORT diagram) and a table with the baseline characteristics of the matched study population by intervention group.
Script: 6_survival-analysis.QMD
- Input:
matched_data
- Output:
survival-analysis.html
results-survival-analysis-<country>.xlsx
A survival analysis is conducted in the matched study population matched_data
. A hazard ratio (HR), the Restricted Mean Survival Time (RMST) and Restricted Mean Time Lost (RMTL) are reported in survival-analysis.html
. Aggregated non-sensitive results for meta-analysis are written to results-survival-analysis-<country>.xlsx
.
This analytical pipeline has been developed and tested in R (version 4.2.1) using RStudio desktop as a IDE (version 2022.07.1). To execute the analytical pipeline using the synthetic data or your own input data compliant with the Common Data Model specification, the dependencies and required installation steps are described below.
For testing purposes we assume a similar environment to the development environment.
The development environment included several R packages (from base R or CRAN) and the use of the R project file (.Rproj) included with the scripts. The required R packages and the version used for developing and testing the analytical pipeline:
- parallel (base package)
- stats (base package)
- graphics (base package)
- grDevices (base package)
- utils (base package)
- datasets (base package)
- dplyr (dplyr_1.1.2)
- arrow (arrow_10.0.1)
- validate (validate_1.1.3)
- DataExplorer (DataExplorer_0.8.2)
- DT (DT_0.27)
- purrr (purrr_1.0.1)
- dlookr (dlookr_0.6.1)
- ggpubr (ggpubr_0.5.0)
- survminer (survminer_0.4.9)
- quarto (quarto_1.2)
- ggplot2 (ggplot2_3.4.0)
- plotly (plotly_4.10.2)
- scales (scales_1.2.1)
- formattable (formattable_0.2.1)
- naniar (naniar_0.6.1)
- duckdb (duckdb_0.6.2)
- DBI (DBI_1.1.3)
- here (here_1.0.1)
- visdat (visdat_0.6.0)
- mice (mice_3.15.0)
- tidyr (tidyr_1.3.0)
- shiny (shiny_1.7.4)
- consort (consort_1.2.0)
- MatchIt (MatchIt_4.5.0)
- survival (survival_3.5-0)
- table1 (table1_1.4.3)
- knitr (knitr_1.41)
- tab (tab_5.1.1)
- forestmodel (forestmodel_0.6.2)
- gtsummary (gtsummary_1.7.2)
- survRM2 (survRM2_1.0-4)
- log4r (log4r_0.4.3)
- xlsx (xlsx_0.6.5)
- finalfit (finalfit_1.0.6)
- dbplyr (dbplyr_2.3.3)
To run the analytical pipeline with the required dependencies, different methods can be adopted: (1) installing R packages manually, (2) using the renv reproducible environment, (3) running the docker image, or (4) using Conda/Mamba.
Download the ZIP file of the repository using the following link: https://github.com/MarjanMeurisse/BY-COVID_WP5_T5.2_baseline-use-case/archive/refs/heads/main.zip
Extract all from the ZIP file and open the R project file contained within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline
in RStudio.
Install the required R packages. To install a specific version of an R package from source, the following R command can be used (example for the R package dplyr, version 1.1.2):
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_1.1.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
Input data, compliant with the Common Data Model specification should be provided as the only file within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/input
. You can test the analytical pipeline with the synthetic data which are already provided within this folder (vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
) or replace this file with your own real-world data.
The file BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/scripts/analytical-pipeline.QMD
can be opened in RStudio and rendered to run the sequential steps in the analytical pipeline. Output files of the analytical pipeline (interactive html reports, xlsx file with aggregated output) are generated withing the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/output
.
Download the ZIP file of the repository using the following link: https://github.com/MarjanMeurisse/BY-COVID_WP5_T5.2_baseline-use-case/archive/refs/heads/main.zip
Extract all from the ZIP file and open the R project file contained within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline
in RStudio.
The R project uses renv. The following R command to check which packages are recorded in the lockfile but which are not installed:
renv::status()
Reproduce the testing environment by running the following R command:
renv::restore()
The metadata from the lockfile are used to install exactly the same version of every package.
Input data, compliant with the Common Data Model specification should be provided as the only file within the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/input
. You can test the analytical pipeline with the synthetic data which are already provided within this folder (vaccine_effectiveness_synthetic_pop_10k_v.1.1.1.csv
) or replace this file with your own real-world data.
The file BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/scripts/analytical-pipeline.QMD
can be opened in RStudio and rendered to run the sequential steps in the analytical pipeline. Output files of the analytical pipeline (interactive html reports, xlsx file with aggregated output) are generated withing the folder BY-COVID_WP5_T5.2_baseline-use-case-main/vaccine_effectiveness_analytical_pipeline/output
.
It is possible to run the analytical pipeline as an isolated application using Docker.
You may skip the docker build
command to download the latest container from GitHub.
cd vaccine_effectiveness_analytical_pipeline
# Enable below if you have modified scripts dependencies
#docker build -t ghcr.io/by-covid/vaccine_effectiveness_analytical_pipeline .
docker run -v `pwd`/input:/pipeline/input -v `pwd`/output:/pipeline/output -it ghcr.io/by-covid/vaccine_effectiveness_analytical_pipeline
Note that when using Docker in this way, file permission on your output
folder may not match up with the container's permissions when writing outputs. (tip: chmod -R 777 output
)
Instead of using containers it can be more convenient during development to use a Conda environment. The below assumes Miniconda have been installed and activated. To install the R packages listed in environment.yml
, use:
cd vaccine_effectiveness_analytical_pipeline
conda env crate
The above installs most of the R packages from Conda-Forge, avoiding a compilation phase. To install the remaining R packages from CRAN:
conda activate vaccine_effectiveness
Rscript install.R
Finally, to execute the main pipeline using Quarto:
conda activate vaccine_effectiveness
cd scripts
quarto render analytical-pipeline.QMD --execute --output-dir ../output/
This should populate output/
content as a series of HTML files.
Note: The environment.yml
is also used by the Dockerfile
to install its dependencies, and may have R packages in newer version than listed above
An Excel file with multiple sheets, each providing different outputs from the survival analysis, is generated by locally running the analysis pipeline (results-survival-analysis-<country>.xlsx
). This Excel file contains information that can be used to compare vaccine effectiveness across sites (see Zenodo publication for local outputs and comparative analysis of three sites).
The excel file contains the following sheets:
- Surv_summary: The survival function for the exposed (stratum 'fully_vaccinated_bl=TRUE') and unexposed group (stratum 'fully_vaccinated_bl=FALSE') is estimated using the Kaplan-Meier estimator. In this sheet, the number at risk, number of events, number censored, and survival probability (probability of not obtaining a SARS-CoV-2 infection beyond a certain time after onset of follow-up, plus confidence interval) for each time point and for each exposure group are presented. (Object 'surv_summary(surv_fit)')
- HR: A Cox regression model (which requires the Proportional Hazards assumption) is built to quantify the effectiveness of completing a primary vaccination schedule in preventing SARS-CoV-2 infection. A hazard ratio (HR) is computed and reported in this sheet. The HR can be interpreted as the instantaneous rate of SARS-CoV-2 infections in individuals that are at risk for obtaining an infection. (Object 'df_sum_cph')
- RMST_RMTL: Restricted Mean Survival Time (RMST, plus confidence interval) and Restricted Mean Time Lost (RMTL, plus confidence interval) over 365 days are computed for each exposure group (Exposed: arm 'fully_vaccinated_bl=TRUE'; Unexposed: arm 'fully_vaccinated_bl=FALSE') and reported in this sheet. The RMST over 365 days can be interpreted as the mean time to the occurrence of a SARS-CoV-2 infection (outcome) during a time period ranging from time 0 to day 365. The RMST and RMTL are suggested as an alternative measure in survival analyses when the assumption of Proportional Hazards does not hold. (Object 'df_rmst_result')
- ATE: An estimate of the average treatment effect (ATE) of interest, calculated in alternative ways (column 'Measure'). (Object 'df_rmst_contrast')
The survival probabilities, HR, RMST/RMTL, and ATE are estimated within subgroups determined by the vaccination schedule received by the case in the matched population (Vaccination_schedule):
- Surv_summary_vacschedule (Object 'surv_summary(surv_fit_vacschedule)')
- HR_vacschedule (Object 'df_sum_cph_subg_vacschedule')
- RMST_RMTL_vacschedule (Object 'df_rmst_result_vacschedule')
- ATE_vacschedule (Object 'df_rmst_contrast_vacschedule')
The survival probabilities, HR, RMST/RMTL, and ATE are estimated within subgroups determined by the NUTS3 residence area of individuals (Residence_area):
- Surv_summary_area (Object 'surv_summary(surv_fit_area)')
- HR_area (Object 'df_sum_cph_subg_area')
- RMST_RMST_area (Object 'df_rmst_result_area')
- ATE_area (Object 'df_rmst_contrast_area')
-
v.1.0.0: Initial iteration of the analytical pipeline scripts
-
v.1.0.1: Minor adjustments
-
v.1.0.2:
-
A log file (logfile.txt) is created in the ./logs folder (system settings, timing, errors)
-
Sex and age group are no longer handled as continuous variables
-
Implement handling large proportions of missing data in core variables
-
Add aggregated output for meta-analysis
-
Matching based on individual-level SES when available
-
Adjusted hover in plotly graphs
-
Additional analysis implemented: survival in subgroups determined by vaccination schedule
-
Adjusted documentation
-
- Meurisse, Marjan
- Estupiñán-Romero, Francisco
- Van Goethem, Nina
- González-Galindo, Javier
- Royo-Sierra, Santiago
- Bernal-Delgado, Enrique
By-COVID (Beyond COVID) is a Horizon Europe funded project (101046203).
Marjan Meurisse - marjan.meurisse@sciensano.be{.email}
Please, note that we provide these scripts as they are, complying with the specifications of BY-COVID WP5 baseline use case for the purposes and objectives specified within the baseline use case protocol. Software is provided as-is without further support out of the scope of the partners participating in BY-COVID WP5.
Footnotes
-
For illustrative purposes, the interactive reports contain the output of the scripts of the analytical pipeline applied to the synthetic dataset. ↩