A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the number of tokens that are yet to be emitted in the encoder features of the current blocks using the CTC posterior. Based on the expectation value, the decoder predicts the endpoint to realize continuous block synchronization, as a running stitch. Meanwhile, endpoint post-determination probabilistically detects backward jump of the source-target attention, which is caused by the misprediction of endpoints. Then it resumes decoding by discarding those hypotheses, as back stitch. We combine these methods into a hybrid approach, namely run-and-back stitch search, which reduces the computational cost and latency. Evaluations of various ASR tasks show the efficiency of our proposed decoding algorithm, which achieves a latency reduction, for instance in the Librispeech test set from 1487 ms to 821 ms at the 90th percentile, while maintaining a high recognition accuracy.
翻译:编码器- 解码器自动语音识别( ASR) 系统流动风格的推论对于降低延迟度非常重要, 这对于互动使用案例至关重要 。 为此, 我们提出一个新颖的块状点同步解码算法, 结合端点预测和端点后定分的混合方法。 在端点预测中, 我们计算当前区块编码特性中尚未排放的标记的预期值。 根据预期值, 解码器预测端点将实现连续的区块同步, 作为连续的缝合。 同时, 最终点后判定概率会检测源目标注意的后向跳, 后者是端点误差导致的。 然后, 我们用后缝合法计算出当前区块编码特性中尚未释放的标记数的预期值。 我们将这些方法合并成混合方法, 即连续和后补的缝合搜索, 降低计算成本和耐久性。 对各种 ASR 任务的评价显示我们提议的分解码算法的效率, 其精确性跳过后, 也就是在透明度测试第 821 度的高度测试中, 降低 。