MCCSet:医疗领域医疗领域中国拼写校正专家附加说明的数据集 (MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction)

from arxiv, The full version of CIKM 2022 accepted resource paper "MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction". (https://dl.acm.org/doi/10.1145/3511808.3557636)

Chinese Spelling Correction (CSC) is gaining increasing attention due to its promise of automatically detecting and correcting spelling errors in Chinese texts. Despite its extensive use in many applications, like search engines and optical character recognition systems, little has been explored in medical scenarios in which complex and uncommon medical entities are easily misspelled. Correcting the misspellings of medical entities is arguably more difficult than those in the open domain due to its requirements of specificdomain knowledge. In this work, we define the task of Medical-domain Chinese Spelling Correction and propose MCSCSet, a large scale specialist-annotated dataset that contains about 200k samples. In contrast to the existing open-domain CSC datasets, MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists. To ensure automated dataset curation, MCSCSet further offers a medical confusion set consisting of the commonly misspelled characters of given Chinese medical terms. This enables one to create the medical misspelling dataset automatically. Extensive empirical studies have shown significant performance gaps between the open-domain and medical-domain spelling correction, highlighting the need to develop high-quality datasets that allow for Chinese spelling correction in specific domains. Moreover, our work benchmarks several representative Chinese spelling correction models, establishing baselines for future work.

翻译：中文拼写校正(CSC)由于承诺自动发现和纠正中文文本中的拼写错误而日益受到越来越多的关注。尽管它在许多应用中广泛使用,如搜索引擎和光学字符识别系统等,但在复杂和不寻常的医疗实体容易拼错的医学假设中却很少探索。由于医学实体对特定领域知识的要求,纠正医学实体的拼写错误可能比开放领域更困难。在这项工作中,我们界定了医学领域中国拼写错误校正的任务,并提议了含有大约200公里样本的大型专家附加说明数据集MCSCSet。与现有的开放域 CSC数据集相比,MCSCSet涉及:(一) 从Tentent Yidian收集的大量真实世界医学查询,(二) 由医学专家手工手写的相应拼写错误的句子比开放域要困难得多。为了确保自动化的数据集整理, MCCSet还提供了一套由中国医学术语中常见拼写错误的字符组成的医学混乱数据集。这使得一个人能够自动创建医学错写数据集。与现有的开放域数据集相比,广泛的实证研究表明,在中国的精确的校正法方面存在着重大的业绩差距。

相关内容