Skip to content

The SQL script for transferring regional Safe Haven data structure into the proposed structure for each Safe Haven, as well as R scripts for laboratory data harmonisation.

License

Notifications You must be signed in to change notification settings

HicResearch/ScottishLabData

Repository files navigation

ScottishLabData - A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses

Background

In Scotland, laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users, working on laboratory datasets across regional cohorts requires effort and time.

Requirements

The over-arching requirement is to develop an open-source software pipeline to generate a harmonised laboratory dataset.

Functional

Some additional functional requirements are:

  • Merging different data tables from the database into the same structure table.
  • Standardising the censored results.
  • Changing all test codes to read codes.
  • Based on the test frequency, focusing on the top 100 recorded Read Codes of each Safe Haven (180 unique Read Codes).
  • Standardising test unit, change test with the same code into having the same unit
  • Technical validation of the harmonised and cleaned data by visualisation

Non-Functional

Two non-functional requirements were also identified:

  • Operate in an environment where compute resources may be limited.
  • Handling the big scale of the data, utilising tools/methods for best performance

Description

We completed the required data governance applications and extracted laboratory tests (biochemistry, haematology, immunology, virology, pathology and microbiology) from all registered SHARE participants who were residents from 01/01/2015 to 31/12/2021 (or until death) in the region of each of the four regional Safe Havens.

The data was provided as flat files from the Safe Havens to HIC and loaded into a Microsoft Structured Query Language (SQL) Server database (version 16.0.1) for ease of handling the large data volumes. Using WhiteRabbit, sampling one million rows in each flat file, we tested the data types, completeness (fraction empty) and consistency (N unique and fraction unique) of each Safe Haven’s laboratory results. We used the Rabbit-In-a-Hat software to read and display WhiteRabbit scan results. Rabbit-In-a-Hat allows a user to connect source data to tables and columns within the target data structure through a graphical user interface. Based on SQL skeleton auto-generated by Rabbit-In-a-Hat, further editing of the mapping was made based on consulting local Safe Havens to establish the final SQL script for structural harmonisation.

Data exploration and harmonisation were conducted using a mixture of SQL and R (version 4.1.3).

This repository

Repository sub-directories:

  • SQL code for table structure/ SQL code for harmonising the structure of data, standardising the censored results as well as changing all text codes to read codes

  • R code for data harmonisation/ R scripts to select the 180 codes, standardising the test unit as well as visualisation.

  • Paper Appendix/ Appendix for published paper related to this work

Running SQL scripts

Tables been loaded into the database:

Tables Name Description
dbo.DaSH DaSH all data
dbo.SCI_Store_20220802 Glasgow all data
dbo.Labs_Biochem HIC biochemistry data
dbo.HaematologyRestructured HIC haematology data
dbo.ImmunologyRestructured HIC immunology data
dbo.VirologyRestructured HIC virology data
dbo.Lothian Lothian all data
dbo.Lothian_TestCode2ReadCode Lothian local code to read code mapping

Scripts should be run in sequence as marked as "a", "b" and "c" in the name. It does not matter which one to run first between the same mark. For example, either run "a DaSH2FHIRscot.sql" before or after "a Glasgow2FHIRscot.sql" is OK.

Running R code

Scripts should be run in sequence as marked as "0", "1", "2" to "6" in the name. It does not matter which one to run first between the same mark.

Acknowledgments

This is research data for only a subset of health board data and no inference or implication should be made regarding relative quality, accuracy, or performance of individual health boards.

We acknowledge the support of the Grampian Data Safe Haven (DaSH) facility within the Aberdeen Centre for Health Data Science and the associated financial support of the University of Aberdeen, and NHS Research Scotland (through NHS Grampian investment in DaSH). This work was supported by DataLoch (dataloch.org), which is core-funded by the Data-Driven Innovation programme within the Edinburgh and South East Scotland City Region Deal (ddi.ac.uk) and the Chief Scientist Office, Scottish Government. The authors would like to acknowledge the support of the DataLoch service for their involvement in obtaining approvals and provisioning access to data from the Lothian region. The authors would like to acknowledge the work of the West of Scotland Safe Haven team at NHS Greater Glasgow and Clyde in supporting the extractions and linkage to de-identified NHS patient datasets. We acknowledge the support of the Health Informatics Centre, University of Dundee for managing and supplying the anonymised data and NHS Tayside/Fife the original data source. This project was supported by funding from Research Data Scotland (RDS) through the Systems Development Fund.

About

The SQL script for transferring regional Safe Haven data structure into the proposed structure for each Safe Haven, as well as R scripts for laboratory data harmonisation.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages