layout	title
home	HDR UK Phenotype Library technical specifications

Technical documentation

Phenotype Library Inclusion Criteria

For an algorithm to be included in the Phenotype Library, it must satisfy the following criteria:

Define a disease (e.g. hypertension), life style risk factor (e.g. smoking) or biomarker (e.g. blood pressure)
Derive information from one or more electronic health record data sources. This can include national and local sources. The definition of EHR includes administrative data such as billing/claims data, and clinical audits.
Have one or more peer-reviewed outputs associated with it e.g. journal publication, scientific conferences, policy white papers etc.
Provide evidence of how the phenotyping algorithm was validated.

Specification

Phenotyping algorithms are stored in the Phenotype Library usign a combination of YAML, CSV and markdown files. There are two main components to each algorothm: a) the phenotype definition file (which is in YAML and markdown) and, b) one or more teminology files (also known as codelists) which are stored as CSV files. The section below provides information on their schema and contents.

Electronic Health Records Phenotyping algorithm

Electronic Health Records Phenotyping algorithmPhenotype definition fileMetadataContent

Codelist file

Codelist file

Codelist file

Codelist file

Codelist file

Codelist fileViewer does not support full SVG 1.1

File naming

All phenotype definition files associated with a phenotype use a common naming pattern:

AUTHORSURNAME_NAME_UUID.md

for example: axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj.md

Phenotype files are stored in the _phenotypes directory.

Similarly, code list files follow a similar pattern:

NAME_UUID_TERMINOLOGY.csv

for example: axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv

Codelist files are stored in the codelists directory.

Phenotype definition file

The phenotype definition file is a markdown file with a YAML header. The YAML header is used to record metadata fields capturing information about the algorithm, the data sources, controlled clinical terminologies and other information.

For example, the code snippet below displays the metadata associated with the bronchiestasis phenotyping algorithm submitted by the HDR UK BREATHE Hub (you can view the raw file directly on the repository.)

layout: phenotype
title: Bronchiestasis
name: Bronchiestasis
phenotype_id: ZckoXfUWNXn8Jn7fdLQuxj 
type: Disease or Syndrome
group: Respiratory
data_sources: 
    - Clinical Practice Research Datalink GOLD
    - Clinical Practice Research Datalink Aurum
    - Hospital Episode Statistics APC for CPRD GOLD
    - Hospital Episode Statistics APC for CPRD Aurum
    - Death Registration data for CPRD GOLD
    - Death Registration data for CPRD Aurum
    - UK Biobank
clinical_terminologies: 
    - Read Version 2
    - SNOMED-CT
    - ICD-10
    - ICD-11
validation: 
    - prognostic
codelists:
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD11.csv
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_SNOMEDCT.csv
    - axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_UKBIOBANK.csv
valid_event_data_range: 01/01/2001 - 31/12/2019
sex: 
    - Female
    - Male
author: 
    - Eleanor L Axson
    - Jennifer K Quint
publications: 
status: BETA
date: 2019-06-20
modified_date: 2019-06-20
version: 1

The metadata fields required are the following:

title (string): Phenotype (long) name
name (string): Phenotype (short) name
data_sources (list of strings): Names of data sources that phenotype sources information from. These should be identical, if possible, to the names used to identify individual datasets in the HDR Gateway.
clinical_terminologies (list of strings): List of controlled clinical terminologies that are used by the phenotype algorithm.
validation (list of strings): evidence of validation used as evidence of phenotype robustness - valid values:
- prognostic: the ability to replicate known prognostic associations
- aetiologic: the ability to replicate known associations with risk factors
- genetic : the abity to replicate associations with known regions or variants
- cross-source: has the algorithm been evaluated in a similar external data source
- casenote review : has the algorithm been validated through manual review of clinical notes (this usually would result to PPV, NPV values)
- cross-country : has the algorithm been evaluated in a similar external healthcare system
codelists (list of strings): (unordered) list of CSV terminology files associated with the phenotype
phenotype_id (list of strings): Unique universal phenotype identifier, generated using the shortuuid Python module.
group (string): Disease group for phenotype
valid_event_data_range (list of strings): DD/MM/YYYY date range for events
sex (list of strings): list of sexes valid for the phenotype
author (list of strings): list of phenotype authors
publications (list of strings): list of publications
status (string): 'DRAFT' or 'FINAL' status
date (string): date created
modified_date (string): date last modified
version (integer): integer version of phenotype, default '1'

Terminology files (codelists)

Codelist files are specified as CSV files with one term per row - for example:

ICD-10 code,ICD-10 term
J47,Bronchiectasis

How to submit data

You can download a sample template file from the repository:

If you have a phenotyping algorithm that meets the eligibility requirements, we invite you to submit your data by one of the following ways:

by 📫 email s.denaxas at ucl.ac.uk
by 🐙 pull request (PR) on the GitHub repository
by ⬆ Google form

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tech.md

tech.md

Technical documentation

Phenotype Library Inclusion Criteria

Specification

File naming

Phenotype definition file

Terminology files (codelists)

How to submit data

Files

tech.md

Latest commit

History

tech.md

File metadata and controls

Technical documentation

Phenotype Library Inclusion Criteria

Specification

File naming

Phenotype definition file

Terminology files (codelists)

How to submit data