End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.
翻译:端到端模型( E2E) 在自动语音识别( ASR) 方面取得了快速进展, 并相对常规模型进行了竞争。 为了进一步提高质量, 提议了双端模型, 使用非流听、 出场和 Spell (LAS) 模型, 并同时保持合理的延时。 该模型涉及声学到中继假设, 而不是只使用第一通文本假设的一类神经校正模型。 在这项工作中, 我们提议使用一个评分网络同时处理声学和第一通假设。 双向编码器用于从第一流假设中提取背景信息。 拟议的评分模型比谷歌语音搜索( VS) 任务中LAS 重力减少12%, 适当无线测试设置减少23% 。 与大型常规模型相比, 我们的最佳模型对VS 的计算复杂度为21 % 。 在计算复杂度方面, 评分解解码器比LAS decoder 要大得多, 因此需要进行更多的计算。