There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets, which is also competitive with the chunk-based methods and the knowledge distillation methods.
翻译:在自动语音识别(ASR)流流中,往往存在性能和延迟之间的权衡。传统方法,如外观头和块基方法,通常需要从未来框架获得信息,才能提高识别准确度,即使计算速度足够快,也不可避免地会出现延迟。一个计算出而没有未来框架的因果模型可以避免这种延迟,但其性能却大大低于传统方法。在本文中,我们提出了相应的修订战略来改进因果模型。首先,我们引入了实时编码器状态修订战略,以修改以前的状态。一旦收到数据,就开始进行前方编码计算,并在几个框架之后修改以前的编码器状态,这不需要等待任何正确的背景。此外,一个四氯化碳加固定位解码算法的设计旨在降低修订战略带来的时间成本。所有实验都在Librispeech数据集上进行。对基于四氯化碳的 wav2vec2.0模型进行微调,我们的最佳方法可以在测试清洁/其他数据集上实现3.7/9.2 WER,这也与基于块的方法和知识蒸馏方法具有竞争力。