Algorithms and technologies are essential tools that pervade all aspects of our daily lives. In the last decades, health care research benefited from new computer-based recruiting methods, the use of federated architectures for data storage, the introduction of innovative analyses of datasets, and so on. Nevertheless, health care datasets can still be affected by data bias. Due to data bias, they provide a distorted view of reality, leading to wrong analysis results and, consequently, decisions. For example, in a clinical trial that studied the risk of cardiovascular diseases, predictions were wrong due to the lack of data on ethnic minorities. It is, therefore, of paramount importance for researchers to acknowledge data bias that may be present in the datasets they use, eventually adopt techniques to mitigate them and control if and how analyses results are impacted. This paper proposes a method to address bias in datasets that: (i) defines the types of data bias that may be present in the dataset, (ii) characterizes and quantifies data bias with adequate metrics, (iii) provides guidelines to identify, measure, and mitigate data bias for different data sources. The method we propose is applicable both for prospective and retrospective clinical trials. We evaluate our proposal both through theoretical considerations and through interviews with researchers in the health care environment.
翻译:分析和技术是贯穿我们日常生活各个方面的基本工具。在过去几十年中,保健研究受益于基于计算机的新招聘方法、使用联盟结构进行数据储存、采用对数据集的创新分析等。然而,保健数据集仍然可能受到数据偏差的影响。由于数据偏差,它们提供了对现实的扭曲看法,导致分析结果错误,从而导致决定。例如,在研究心血管疾病风险的临床试验中,预测错误是因为缺少少数民族数据。因此,研究人员必须认识到在他们使用的数据集中可能存在的数据偏差,最终采用技术来减轻数据偏差,并在分析结果受到影响时和如何加以控制。本文提出一种方法来解决数据集中的偏差:(一) 界定数据集中可能存在的数据偏差类型,(二) 以适当的度量来描述和量化数据偏差,(三) 提供指南,用以确定、测量和减轻数据偏差,供他们使用的数据集中可能存在的偏差。因此,研究人员必须认识到数据偏差,最终采用技术来减轻数据偏差,并在分析结果受到影响时加以控制。本文提出一种方法,以便(二) 通过临床试验和实验,我们提出对不同数据来源进行前瞻性的检查。我们提议采用的方法。我们采用。