MeDDG:实体-知识医疗对话生成实体-核心医疗咨询数据集 (MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware Medical Dialogue Generation)

Developing conversational agents to interact with patients and provide primary clinical advice has attracted increasing attention due to its huge application potential, especially in the time of COVID-19 Pandemic. However, the training of end-to-end neural-based medical dialogue system is restricted by an insufficient quantity of medical dialogue corpus. In this work, we make the first attempt to build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG, with more than 17K conversations collected from the online health consultation community. Five different categories of entities, including diseases, symptoms, attributes, tests, and medicines, are annotated in each conversation of MedDG as additional labels. To push forward the future research on building expert-sensitive medical dialogue system, we proposes two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation. To acquire a clear comprehension on these two medical dialogue tasks, we implement several state-of-the-art benchmarks, as well as design two dialogue models with a further consideration on the predicted entities. Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset, and the response quality can be enhanced with the help of auxiliary entity information. From human evaluation, the simple retrieval model outperforms several state-of-the-art generative models, indicating that there still remains a large room for improvement on generating medically meaningful responses.

翻译：由于应用潜力巨大,发展与病人互动和提供初级临床咨询的谈话代理物引起了越来越多的关注,特别是在COVID-19大流行时期,开发与病人互动和提供初级临床咨询的谈话代理物引起了越来越多的关注。然而,对基于端到端的神经医疗对话系统的培训因医疗对话系统数量不足而受到限制。在这项工作中,我们第一次尝试建立和发布与12种常见肠胃疾病(名为MedDG)有关的大规模高质量医疗对话数据集,从在线卫生咨询界收集了17个以上的有意义的对话。五类不同实体,包括疾病、症状、属性、测试和药物,在MDDDG的每次对话中都附加了附加标签。为了推进今后关于建立对专家敏感的医疗对话系统的研究,我们建议根据MDDG数据集开展两类医疗对话任务。一个是下一个实体的预测,另一个是医生反应生成。为了清楚理解这两项医疗对话任务,我们在那里实施了若干项最先进的标准基准,以及设计两个对话模式,进一步考虑预测的实体的预测反应。实验结果显示,在建立对专家敏感的医疗对话中,先期和先期性语言进行更多的数据检索。