Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multi-mode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and LibriSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
翻译:自动语音识别( ASR) 模式在更周围的语音信息被作为上下文来介绍时会减少错误。 不幸的是, 获得更大的未来环境会导致更高的延迟性。 速度和准确性之间有着不可避免的权衡。 一般来说, 要适应不同的潜伏要求, 人们就必须储存多种模型, 并在制约下选择最好的。 相反, 更可取的方法是, 建立一个单一模型, 能够根据不同的制约因素动态调整其潜伏性, 我们称之为多模式 ASR 。 多模式 ASR 模式可以在推论期间满足各种潜伏要求 -- -- 当更大的延缓性可以被接受时, 模型可以处理更长的未来环境, 以便实现更高的准确性, 当悬浮性预算不灵活时, 模型可以减少对未来环境的依赖, 但仍然可以实现可靠的准确性。 在追求多模式 ASR 时, 我们提出一个简单的培训程序, 一个简单的培训程序, 在每次循环中取样一个流动配置。 通过对 AISHELL-1 和 LibriSpeech数据集的广泛实验, 我们展示多模式的模型, 一个具有竞争性的预算的模型, 如果不是超过一个不同的基线, 。