ScottishLabData - A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses
If you are using this software, please reference the following paper:
Gao, C., Mumtaz, S., McCall, S., O’Sullivan, K., McGilchrist, M., Morales, D.R., Hall, C., Wilde, K., Mayor, C., Linksted, P. and Harrison, K., 2025. A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses. Journal of Biomedical Informatics, p.104771. https://doi.org/10.1016/j.jbi.2024.104771
In Scotland, laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users, working on laboratory datasets across regional cohorts requires effort and time.
The over-arching requirement is to develop an open-source software pipeline to generate a harmonised laboratory dataset.
Some additional functional requirements are:
- Merging different data tables from the database into the same structure table.
- Standardising the censored results.
- Changing all test codes to read codes.
- Based on the test frequency, focusing on the top 100 recorded Read Codes of each Safe Haven (180 unique Read Codes).
- Standardising test unit, change test with the same code into having the same unit
- Technical validation of the harmonised and cleaned data by visualisation
Two non-functional requirements were also identified:
- Operate in an environment where compute resources may be limited.
- Handling the big scale of the data, utilising tools/methods for best performance
We completed the required data governance applications and extracted laboratory tests (biochemistry, haematology, immunology, virology, pathology and microbiology) from all registered SHARE participants who were residents from 01/01/2015 to 31/12/2021 (or until death) in the region of each of the four regional Safe Havens.
The data was provided as flat files from the Safe Havens to HIC and loaded into a Microsoft Structured Query Language (SQL) Server database (version 16.0.1) for ease of handling the large data volumes. Using WhiteRabbit, sampling one million rows in each flat file, we tested the data types, completeness (fraction empty) and consistency (N unique and fraction unique) of each Safe Haven’s laboratory results. We used the Rabbit-In-a-Hat software to read and display WhiteRabbit scan results. Rabbit-In-a-Hat allows a user to connect source data to tables and columns within the target data structure through a graphical user interface. Based on SQL skeleton auto-generated by Rabbit-In-a-Hat, further editing of the mapping was made based on consulting local Safe Havens to establish the final SQL script for structural harmonisation.
Data exploration and harmonisation were conducted using a mixture of SQL and R (version 4.1.3).
Repository sub-directories:
-
SQL code for table structure/ SQL code for harmonising the structure of data, standardising the censored results as well as changing all text codes to read codes
-
R code for data harmonisation/ R scripts to select the 180 codes, standardising the test unit as well as visualisation.
-
Paper Appendix/ Appendix for published paper related to this work
Tables been loaded into the database:
Tables Name | Description |
---|---|
dbo.DaSH | DaSH all data |
dbo.SCI_Store_20220802 | Glasgow all data |
dbo.Labs_Biochem | HIC biochemistry data |
dbo.HaematologyRestructured | HIC haematology data |
dbo.ImmunologyRestructured | HIC immunology data |
dbo.VirologyRestructured | HIC virology data |
dbo.Lothian | Lothian all data |
dbo.Lothian_TestCode2ReadCode | Lothian local code to read code mapping |
Scripts should be run in sequence as marked as "a", "b" and "c" in the name. It does not matter which one to run first between the same mark. For example, either run "a DaSH2FHIRscot.sql" before or after "a Glasgow2FHIRscot.sql" is OK.
Scripts should be run in sequence as marked as "0", "1", "2" to "6" in the name. It does not matter which one to run first between the same mark.
This is research data for only a subset of health board data and no inference or implication should be made regarding relative quality, accuracy, or performance of individual health boards.
We acknowledge the support of the Grampian Data Safe Haven (DaSH) facility within the Aberdeen Centre for Health Data Science and the associated financial support of the University of Aberdeen, and NHS Research Scotland (through NHS Grampian investment in DaSH). This work was supported by DataLoch (dataloch.org), which is core-funded by the Data-Driven Innovation programme within the Edinburgh and South East Scotland City Region Deal (ddi.ac.uk) and the Chief Scientist Office, Scottish Government. The authors would like to acknowledge the support of the DataLoch service for their involvement in obtaining approvals and provisioning access to data from the Lothian region. The authors would like to acknowledge the work of the West of Scotland Safe Haven team at NHS Greater Glasgow and Clyde in supporting the extractions and linkage to de-identified NHS patient datasets. We acknowledge the support of the Health Informatics Centre, University of Dundee for managing and supplying the anonymised data and NHS Tayside/Fife the original data source. This project was supported by funding from Research Data Scotland (RDS) through the Systems Development Fund.