Skip to content

Define clinical phenotypes

Ming Wai Yeung edited this page May 27, 2022 · 1 revision

Build definition table for the UK Biobank data

Health outcome information captured by different data sources/data fields is encoded differently. These relationships have been curated and recorded in the data settings file included in the ukbpheno package. For a target phenotype, we need to identify all relevant diagnosis/operation codes by surveying the various data sources/ data fields on the Showcase.

For example, participants with the following codes are likely to suffer from coronary artery disease:

Variable ICD-9 ICD-10 OPCS-4 Self-reported fields READ2 CTV3
Coronary artery disease 414, 410, 412 I24, I25, Z955, I21, I22, I23 K40, K41, K42, K43, K44, K45, K46, K49, K50, K75 20002(1075), 20004(1070, 1095, 1523),6150(1) G34y1, G34.., G3..., ZV45L, G34z0, ZV458, 793G., 79280, 79281, 79282, 7928y, 7928z, 79292, 7929y, 7929z, 792.., 7A547, 793Gy,793Gz,79283 G34y1, XE0WG, XE2uV, XaC1g, XaG1Q, XaQiY, ZV458, G34.., X200b, Xa1dP, XaLgU, 79280, 79281, 79282, 7928y, 7928z, 79292, 7929y, 7929z, X00tT, X013N, XE0Em, XaLgZ, XaLga, XaMKE

Fill these codes in a definition table (prefilled_template) which will be read by the ukbpheno package.

Syntax of definition table

Fill in one phenotype (such as Cad) per row. The column “TRAIT” contains the unique identifier which is case sensitive. Fill in the codes in the corresponding coding systems. For example in the "ICD10" column for CAD:

figure3_h

For the “TS” (touchscreen) column

  1. Fill in field number as Showcase followed by the condition e.g. “6177=3(insulin)”

  2. The corresponding age of diagnosis can be added with “[]” following the condition e.g. “4041=1[2976](Gestational diabetes)”

  3. Conditions symbols accepted:

    Condition symbol Meaning
    = Equal to (value)
    != Not equal to
    < Smaller than
    <= OR ≤ Smaller than or equal to
    > Larger than
    >= OR ≥ Larger than or equal to

UKB code explorer

A shiny app to cross-reference codes between systems using the all_lkps_maps_v*.xlsx provided by UK Biobank is included in the package. All other required coding files are included in the package (/inst/extdata/)

Rscript shiny.lookup_codes.R  --help 
Rscript shiny.lookup_codes.R  --fcoding_xls path_to/all_lkps_maps_v3.xlsx \
--f_med_readSR path_to/dfCodesheetREAD_SR.Coding.RData \
--fcoding_icd10 path_to/ICD10.coding19.tsv \ 
--fcoding_icd9 path_to/ICD9.coding87.tsv \ 
--fcoding_opcs4 path_to/OPCS4.coding240.tsv \ 
--fcoding_20003 path_to/20003.coding4.tsv

figure4-ukb_code_explorer

Composite phenotype

Composite phenotype is a phenotype that includes/excludes other phenotypes. For example, a composite phenotype “diabetes mellitus” may include two phenotypes “type 1 diabetes” and “type 2 diabetes”. The following 4 columns are used to construct composite phenotype:

TRAIT Exclude_from_cases Study_population Exclude_from_controls Include_definitions
Dm   DmRx  DmT1, DmT2

“Study_population” can be used to restrict definition on a subgroup of participants with specific phenotype. Participants with phenotypes in “Include_definition” will be considered to be a case for the composite phenotype. Exclude_from_cases” and “Exclude_from_controls” exclude participants with certain phenotype(s) from cases and controls respectively.