Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER.
翻译:以关注为基础的编码器- decoder 架构, 如听、 听、 听和 Spell (LAS), 将传统自动语音识别( ASR) 系统的声学、 发音和语言模型组件纳入单一神经网络。 在先前的工作中, 我们已显示, 这些架构可与关于听写任务的最新的 ASR 系统相比, 但尚不清楚这些架构是否对诸如语音搜索等更具挑战性的任务具有实用性。 在这项工作中, 我们探索了对我们的LAS 模式进行各种结构和优化改进, 大大提高了性能。 在结构方面, 我们展示了可以使用传统自动语音识别( ASR) 系统的声学、 发音和语言模型。 我们引入了多头关注结构, 相对于常用的单头关注。 在优化方面, 我们探索了同步培训、 预定的取样、 平滑动、 最小单向错误率优化等技术, 都显示这些技术可以提高准确性。 我们用单向 LSTM 编码模型展示了各种结果, 来显著地改进业绩。 在12 500小时的 Wp 模式上, 我们从 WES 测试 系统 提供 5- 5 系统 最佳的 WIS 测试 R 系统, 同时 改进了我们提供 的 W.