In this paper, we propose a hybrid text normalization system using multi-head self-attention. The system combines the advantages of a rule-based model and a neural model for text preprocessing tasks. Previous studies in Mandarin text normalization usually use a set of hand-written rules, which are hard to improve on general cases. The idea of our proposed system is motivated by the neural models from recent studies and has a better performance on our internal news corpus. This paper also includes different attempts to deal with imbalanced pattern distribution of the dataset. Overall, the performance of the system is improved by over 1.5% on sentence-level and it has a potential to improve further.
翻译:在本文中,我们提议采用多头人自我注意的混合文本正常化制度,该系统将基于规则的模式和文本预处理任务的神经模型的优点结合起来。以前对普通话文本正常化的研究通常使用一套手写规则,在一般情况下很难改进。我们提议的系统的想法是由最近研究的神经模型驱动的,在我们的内部新闻资料中表现较好。本文件还包括处理数据集分布不均的各种尝试。总的来说,该系统在判决一级的业绩提高了1.5%以上,而且有可能进一步改进。