There is growing interest in unifying the streaming and full-context automatic speech recognition (ASR) networks into a single end-to-end ASR model to simplify the model training and deployment for both use cases. While in real-world ASR applications, the streaming ASR models typically operate under more storage and computational constraints - e.g., on embedded devices - than any server-side full-context models. Motivated by the recent progress in Omni-sparsity supernet training, where multiple subnetworks are jointly optimized in one single model, this work aims to jointly learn a compact sparse on-device streaming ASR model, and a large dense server non-streaming model, in a single supernet. Next, we present that, performing supernet training on both wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not only substantially improve the large non-streaming model as shown in prior works, and also be able to improve the compact sparse streaming model.
翻译:人们越来越希望将流式和全文自动语音识别(ASR)网络统一为单一端至端自动语音识别(ASR)模式,以简化两种使用案例的示范培训和部署。在现实世界的ASR应用中,流式ASR模型通常比任何服务器侧端全文本模型都更容易在存储和计算限制(例如嵌入设备上)下运行。在Omniparity超级网络培训的最新进展的推动下,多个子网络在一个单一模型中共同优化,这项工作的目的是在单一超级网络中共同学习一个紧凑的在设备上稀疏的流式ASR模型和一个大型密集服务器非流式模型。接下来,我们指出,在wav2vec 2.0 自我监督的自我监督学习和ASR微调两个模型上进行超级网络培训不仅能够大大改进先前工程中显示的大非流式非流式模型,而且能够改进紧凑的流式流式流式流式模型。