Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy that has spread and been applied worldwide. The unique TCM diagnosis and treatment system requires a comprehensive analysis of a patient's symptoms hidden in the clinical record written in free text. Prior studies have shown that this system can be informationized and intelligentized with the aid of artificial intelligence (AI) technology, such as natural language processing (NLP). However, existing datasets are not of sufficient quality nor quantity to support the further development of data-driven AI technology in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system -- syndrome differentiation (SD) -- and we introduce the first public large-scale dataset for SD, called TCM-SD. Our dataset contains 54,152 real-world clinical records covering 148 syndromes. Furthermore, we collect a large-scale unlabelled textual corpus in the field of TCM and propose a domain-specific pre-trained language model, called ZY-BERT. We conducted experiments using deep neural networks to establish a strong performance baseline, reveal various challenges in SD, and prove the potential of domain-specific pre-trained language model. Our study and analysis reveal opportunities for incorporating computer science and linguistics knowledge to explore the empirical validity of TCM theories.
翻译:中国传统医学是一种天然、安全、有效的自然疗法,已在全世界推广应用。独特的创伤后精神科诊断和治疗系统要求对临床记录中以免费文本撰写的临床记录中隐藏的患者症状进行全面分析。先前的研究显示,借助人工智能(AI)技术,如自然语言处理(NLP),该系统可以信息化和智能化。然而,现有的数据集质量和数量都不足以支持在创伤后精神科中进一步开发数据驱动的AI技术。因此,在本文件中,我们侧重于TCM诊断和治疗系统的核心任务 -- -- 综合特征区别(SD) -- -- 我们为SD引入了第一个公共大规模的数据集,称为TCM-SD。我们的数据集包含涵盖148个综合症的54,152个真实世界临床记录。此外,我们在TCM领域收集了一个大规模无标签的文本集,并提议了一个特定领域预先培训的语言模式,称为ZY-BERT。我们利用深神经网络进行了实验,以建立强大的性能基准,揭示了SDD的各种挑战,并证明了我们进行实地科学研究和实验性理论分析的可能性。