There has been a rapidly growing interest in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence about their symptoms and relevant antecedents, and possibly make predictions about the underlying diseases. Doctors would review the interactions, including the evidence and the predictions, collect if necessary additional information from patients, before deciding on next steps. Despite recent progress in this area, an important piece of doctors' interactions with patients is missing in the design of these systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset of roughly 1.3 million patients that includes a differential diagnosis, along with the ground truth pathology, symptoms and antecedents, for each patient. Unlike existing datasets which only contain binary symptoms and antecedents, this dataset also contains categorical and multi-choice symptoms and antecedents useful for efficient data collection. Moreover, some symptoms are organized in a hierarchy, making it possible to design systems able to interact with patients in a logical way. As a proof-of-concept, we extend two existing AD and ASD systems to incorporate the differential diagnosis, and provide empirical evidence that using differentials as training signals is essential for the efficiency of such systems. The dataset is available at \href{https://figshare.com/articles/dataset/DDXPlus_Dataset/20043374}{https://figshare.com/articles/dataset/DDXPlus\_Dataset/20043374}.
翻译:对机器学习研究文献中的自动症状检测(ASD)和自动诊断(AAD)系统的兴趣迅速增加,这些系统旨在协助医生进行远程医疗服务。这些系统旨在与患者互动,收集关于其症状和相关前兆的证据,并有可能对基本疾病作出预测。医生将审查这些互动,包括证据和预测,必要时从患者那里收集更多的信息,然后决定下一步。尽管最近在这方面取得了进展,但是在这些系统的设计中,医生与患者的互动中缺少重要部分,即差异诊断。缺乏这些系统的主要原因是缺乏数据集,其中包括用于培训模型的此类信息。在这项工作中,我们提供了大约130万患者的大规模合成数据集,其中包括差异诊断,以及地面真相病理学、症状和前兆。与现有的数据集不同,该数据集包含直线和多曲的症状,以及用于诊断模型的预言。此外,在逻辑诊断中,有些症状可以被组织起来,在现有的系统里,可以提供我们所了解的诊断系统/诊断系统。