Automatic data extraction from 24 hour blood pressure measurement reports of a large multicenter clinical trial.

Janis M Nolde; Ajmal Mian; Luca Schlaich; Justine Chan; Leslie Marisol Lugo-Gavidia; Nicola Barrie; Vishal Gopal; Graham S Hillis; Clara K Chow; Markus P Schlaich
Abstract
Ambulatory blood pressure monitoring (ABPM) is usually reported in descriptive values such as circadian averages and standard deviations. Making use of the original, individual blood pressure measurements may be advantageous, particularly for research purposes, as this increases the flexibility of the analytical process, enables alternative statistical analyses and provide novel insights. Here we describe the development of a new multistep, hierarchical data extraction algorithm to collect raw data from .pdf reports and text files as part of a large multi-center clinical study.Original reports were saved in a nested file system, from which they were automatically extracted, read and saved into databases with custom made programs written in Python 3. Data were further processed, cleaned and relevant descriptive statistics such as averages and standard deviations calculated according to a variety of definitions of day- and night-time. Additionally, data control mechanisms for manual review of the data and programmatic auto-detection of extraction errors was implemented as part of the project.The developed algorithm extracted 97% of the data automatically, the missing data consisted mostly of reports that were saved incorrectly or not formatted in the specified way. Manual checks comparing samples of the extracted data to original reports indicated a high level of accuracy of the extracted data, no errors introduced due to flaws in the extraction software were detected in the extracted dataset.The developed multistep, hierarchical data extraction algorithm facilitated collection from different file formats and paired with database cleaning and data processing steps led to an effective and accurate assembly of raw ABPM data for further and adjustable analyses. Manual work was minimized while data quality was ensured with standardized, reproducible procedures.
Journal COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE
ISSN 1872-7565
Published 01 Feb 2022
Volume 214
Issue
Pages 106588
DOI 10.1016/j.cmpb.2021.106588
Type Journal Article | Multicenter Study
Sponsorship