Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).
翻译:受此事实的启发,我们进行了一项新颖的研究,研究明确将意向表达作为补充信息纳入系统的影响,以改进基于常规神经网络-传输器(RNN-T)的自动语音识别系统。音频到意图模型以嵌入或后缀的形式将表达意向编码成一个5.56%相对字错误率降低值。另一方面,一个流流系统使用每框架意图后缀作为RNN-T ASR系统的额外投入。试验一个50公里远处的远处英语语音资料库,该研究显示,在以非流式模式运行系统时,如果从整个发言中提取意向表示,然后从一开始用于偏向流出RNN-T搜索(ASR)系统,则提供5.56%相对字错误率降低值。另一方面,一个流系统使用每框架意向后缀作为RNN-T ASR系统的额外投入,产生3.33%的相对WERR。对流式系统进行进一步的详细分析,显示在流式媒体意图上显示我们的拟议方法将带来良好的收益。