This paper presents a self-supervised learning method for pointer-generator networks to improve spoken-text normalization. Spoken-text normalization that converts spoken-style text into style normalized text is becoming an important technology for improving subsequent processing such as machine translation and summarization. The most successful spoken-text normalization method to date is sequence-to-sequence (seq2seq) mapping using pointer-generator networks that possess a copy mechanism from an input sequence. However, these models require a large amount of paired data of spoken-style text and style normalized text, and it is difficult to prepare such a volume of data. In order to construct spoken-text normalization model from the limited paired data, we focus on self-supervised learning which can utilize unpaired text data to improve seq2seq models. Unfortunately, conventional self-supervised learning methods do not assume that pointer-generator networks are utilized. Therefore, we propose a novel self-supervised learning method, MAsked Pointer-Generator Network (MAPGN). The proposed method can effectively pre-train the pointer-generator network by learning to fill masked tokens using the copy mechanism. Our experiments demonstrate that MAPGN is more effective for pointer-generator networks than the conventional self-supervised learning methods in two spoken-text normalization tasks.
翻译:本文展示了一种自监督的定位生成器网络学习方法, 以改善语音文本的正常化。 将口式文本转换成普通文本的口式文本正统化正成为改进随后处理, 如机器翻译和总和化等重要技术。 迄今为止最成功的语音文本正常化方法是序列到序列( seq2seq), 使用带有输入序列复制机制的指针生成器网络进行绘图。 但是, 这些模型需要大量对称的对称数据, 包括口式文本和风格普通文本, 并且难以编制如此大量的数据。 为了从有限的配对数据中构建语音文本正常化模型, 我们侧重于自监督的学习方法, 它可以使用非虚拟文本文本的文本数据来改进后继2eq 模型。 不幸的是, 传统的自我监督学习方法并不假定使用了点生成器网络的复制机制。 因此, 我们提出了一种新型的自我监督的自我监督学习方法, MASK- Gener- gener 网络( MAPGN), 拟议的方法可以有效地将常规网络的校正前测试机制复制成。