While pre-trained Chinese language models have demonstrated impressive performance on a wide range of NLP tasks, the Chinese Spell Checking (CSC) task remains a challenge. Previous research has explored using information such as glyphs and phonetics to improve the ability to distinguish misspelled characters, with good results. However, the generalization ability of these models is not well understood: it is unclear whether they incorporate glyph-phonetic information and, if so, whether this information is fully utilized. In this paper, we aim to better understand the role of glyph-phonetic information in the CSC task and suggest directions for improvement. Additionally, we propose a new, more challenging, and practical setting for testing the generalizability of CSC models. All code is made publicly available.
翻译:虽然经过培训的中文模型在广泛的国家语言方案任务中表现出了令人印象深刻的成绩,但中文拼写检查任务仍是一项挑战。以前的研究探索了如何利用诸如格字和语音等信息来提高区分拼错字符的能力,并取得了良好的效果。然而,这些模型的概括性能力还不能很好地理解:不清楚它们是否包括了格字语言信息,如果包括了,这种信息是否得到了充分利用。在本文中,我们的目标是更好地了解格字语言信息在国家语言方案任务中的作用,并提出改进的方向。此外,我们提出了一个新的、更具挑战性和实用性的环境,用于测试CSC模式的可通用性。所有代码都公布于众。