Recently, attention-based encoder-decoder (AED) end-to-end (E2E) models have drawn more and more attention in the field of automatic speech recognition (ASR). AED models, however, still have drawbacks when deploying in commercial applications. Autoregressive beam search decoding makes it inefficient for high-concurrency applications. It is also inconvenient to integrate external word-level language models. The most important thing is that AED models are difficult for streaming recognition due to global attention mechanism. In this paper, we propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers (WFST) to solve these problems together. We switch from autoregressive beam search to CTC branch decoding, which performs first-pass decoding with WFST in chunk-wise streaming way. The decoder branch then performs second-pass rescoring on the generated hypotheses non-autoregressively. On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR. Further experiments on our 10,000-hour Mandarin task show the proposed method achieves more than 20% improvements with 50% latency compared to a strong TDNN-BLSTM lattice-free MMI baseline.
翻译:最近,基于关注的编码器-编码器(AED)端对端模型(E2E)在自动语音识别(ASR)领域吸引了越来越多的注意力。但是,在商业应用中,AED模型在部署时仍然有缺点。自动递增波束搜索解码使高通通通通应用程序无效。整合外部字级语言模型也是不方便的。最重要的是,AED模型由于全球注意机制而难以流开自由识别。在本文中,我们提出了一个新颖的框架,即WNARS,使用混合的CT-注意AED模型和加权的有限状态导体(WFST)来共同解决这些问题。我们从自动递增波段搜索转向CT分支解码,该分支在粗通的流中进行与WFST的第一次通过解码的解码操作。然后,解码分支对产生的假设进行二流解码。在AISELL-1任务中,我们WARS实现了5-22比A-MLM(M-MADR)更精确的字符错误率比率,在640MM(A-MA-N)轨道上进一步显示我们10000的自动实验方法。