Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and requires only a parallel corpus of speech and text for training. However, unlike in conventional approaches that combine separate acoustic and language models, it is not clear how to use additional (unpaired) text. While there has been previous work on methods addressing this problem, a thorough comparison among methods is still lacking. In this paper, we compare a suite of past methods and some of our own proposed methods for using unpaired text data to improve encoder-decoder models. For evaluation, we use the medium-sized Switchboard data set and the large-scale Google voice search and dictation data sets. Our results confirm the benefits of using unpaired text across a range of methods and data sets. Surprisingly, for first-pass decoding, the rather simple approach of shallow fusion performs best across data sets. However, for Google data sets we find that cold fusion has a lower oracle error rate and outperforms other approaches after second-pass rescoring on the Google voice search data set.
翻译:基于关注的经常性神经编码器脱代器模型为自动语音识别问题提供了一个优雅的解决方案。 这种方法将声学模型、 发音模型和语言模型折叠成一个单一的网络, 只需要平行的语音和文本来进行培训。 但是, 与将不同的声学和语言模型相结合的传统方法不同, 不清楚如何使用额外的( 无法调用的) 文本。 虽然以前曾就解决这一问题的方法进行过工作, 但各种方法之间仍然缺乏彻底的比较。 在本文中, 我们比较了一套过去的方法和一些我们自己提出的使用未调文本数据的方法来改进编码解码模型模型的方法。 在评估中, 我们使用中型交换机数据集和大规模谷歌语音搜索和专注数据集。 我们的结果证实了在一系列方法和数据集中使用未调用的文本的好处。 令人惊讶的是, 首先解码, 浅质聚合法的非常简单的方法在数据集中表现得最优。 然而, 在谷歌第二套数据组中, 我们发现, 冷调的感应是在谷歌的搜索率和变形其他数据集之后, 。