Skip to content

Harmonize all datafiles

Ming Wai Yeung edited this page May 31, 2022 · 2 revisions

Process definition table

read_defnition_table() expands parent codes using the code maps and codes relevant for inclusion and exclusion will be sorted out accordingly. Code maps should include all available codes. The function also cross-check codes entered in the definition with the code maps and warn users of any non-matching codes (e.g. a specific ICD10 code may not exist in the UK Biobank ICD10 code map as this code is not present in the data or typos).

library(data.table)
library(ukbpheno)

# the directory with datafiles
pheno_dir <-"mydata/ukb12345/"

extdata_dir <- paste0(system.file("extdata", package="ukbpheno"),"/")

fdata_setting <- paste0(extdata_dir,"data.settings.tsv")
dfData.settings <- fread(fdata_setting)

fdefinitions <- paste0(extdata_dir,"definitions_cardiometabolic_traits.tsv")
dfDefinitions_processed_expanded<-read_defnition_table(fdefinitions,fdata_setting,extdata_dir)

Harmonize all data from different sources

With a definition table (example containing selected cardio-metabolic phenotypes) and all available data files at hand, we can proceed to parse the data files and harmonize them into single event episode format.For individual level data including the self-report data, cancer registry and optionally death registry, the corresponding fields containing the information on the diagnosis and time of diagnosis are extracted (in the corresponding data types) from the main dataset and converted to the episodes of clinical events. Touchscreen data is processed according to the conditions described in the definition table, if one is provided. The record level data, downloaded from the data portal, will be parsed and reorganized by the data source and classification system.

The "allow_missing_fields" flag in function harmonize_ukb_data() specifies whether field(s) required on the definition table but missing in the main dataset is allowed and ignored. If this flag is set to “FALSE”, the harmonization step will halt in case of any missing field. If the participant withdrawal list is provided, records of these individuals will be removed.

# main dataset 
fukbtab <- paste(pheno_dir,"ukb12345.tab",sep="")

# meta data file
fhtml <- paste(pheno_dir,"ukb12345.html",sep="")

# hospital inpatient data
fhesin <- paste(pheno_dir,"hesin.txt",sep="")
fhesin_diag <- paste(pheno_dir,"hesin_diag.txt",sep="")
fhesin_oper <- paste(pheno_dir,"hesin_oper.txt",sep="")

# GP data
fgp_clinical <- paste(pheno_dir,"gp_clinical.txt",sep="")
fgp_scripts <- paste(pheno_dir,"gp_scripts.txt",sep="")

# Participant withdrawal list
f_withdrawal<-paste(pheno_dir,"w12345_20210809.csv",sep="")


lst.harmonized.data<-harmonize_ukb_data(f.ukbtab = fukbtab,f.html = fhtml,dfDefinitions=dfDefinitions_processed_expanded,f.gp_clinical = fgp_clinical,f.gp_scripts = fgp_scripts,f.hesin = fhesin,f.hesin_diag = fhesin_diag,f.hesin_oper =fhesin_oper,f.death_portal = fdeath_portal,f.death_cause_portal = fdeath_cause_portal,f.withdrawal_list=f_withdrawal,allow_missing_fields = TRUE)

View(lst.harmonized.data$lst.data)

View(lst.harmonized.data$lst.data$tte.hesin.icd10.primary)

Format of the harmonized data

After harmonization , we will get a list of clinical events organized by data source and classification system in the same format such as this (synthetic example for self-report data):

identifier eventdate code event
1234567 2006-03-22 1471 1
1234567 2006-03-23 1075 0
2469134 2006-03-23 1588 1
3570245 2011-04-07 1076 1
4827158 2003-05-06 1223 1
5517284 2015-03-29 1065 0

For hospital inpatient records which also provide duration of the stay, one additional column "epidur" is included. The event column indicates if this episode contains a true event date:

  • Event date from linked data or self-report operation = 1
  • Self-report event date except for operations = 2
  • Not a true event date = 0 (diagnosis codes with unknown event date will have date of visit to assessment center in the “eventdate” column)