Synthetic data are becoming a critical tool for building artificially intelligent systems. Simulators provide a way of generating data systematically and at scale. These data can then be used either exclusively, or in conjunction with real data, for training and testing systems. Synthetic data are particularly attractive in cases where the availability of ``real'' training examples might be a bottleneck. While the volume of data in healthcare is growing exponentially, creating datasets for novel tasks and/or that reflect a diverse set of conditions and causal relationships is not trivial. Furthermore, these data are highly sensitive and often patient specific. Recent research has begun to illustrate the potential for synthetic data in many areas of medicine, but no systematic review of the literature exists. In this paper, we present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine. We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
翻译:合成数据正成为构建人工智能系统的重要工具。模拟器提供了一种系统而有规律地生成数据的方式。这些数据可以单独使用,也可以与真实数据一起用于系统的训练和测试。合成数据特别适用于真实训练样本的可用性可能成为瓶颈的情况。尽管医疗保健领域的数据量呈指数级增长,但为新任务创建数据集和/或反映多种病情和因果关系并非易事。此外,这些数据非常敏感,通常涉及患者个人隐私。最近的研究开始展示了合成数据在许多医学领域的潜力,但尚无系统的文献综述存在。在本文中,我们提出了物理和统计模拟创建数据的方法,并讨论了合成数据在医疗保健和医学领域的应用。我们讨论了合成数据能促进隐私、公平、安全和持续和因果学习,但也存在引入缺陷、盲点和传播或夸大偏见的风险。