We consider the problem of recognizing speech utterances spoken to a device which is generating a known sound waveform; for example, recognizing queries issued to a digital assistant which is generating responses to previous user inputs. Previous work has proposed building acoustic echo cancellation (AEC) models for this task that optimize speech enhancement metrics using both neural network as well as signal processing approaches. Since our goal is to recognize the input speech, we consider enhancements which improve word error rates (WERs) when the predicted speech signal is passed to an automatic speech recognition (ASR) model. First, we augment the loss function with a term that produces outputs useful to a pre-trained ASR model and show that this augmented loss function improves WER metrics. Second, we demonstrate that augmenting our training dataset of real world examples with a large synthetic dataset improves performance. Crucially, applying SpecAugment style masks to the reference channel during training aids the model in adapting from synthetic to real domains. In experimental evaluations, we find the proposed approaches improve performance, on average, by 57% over a signal processing baseline and 45% over the neural AEC model without the proposed changes.
翻译:我们考虑了对一个正在生成已知声波变形装置的语音表达方式的识别问题;例如,承认向一个数字助理发出的询问,该数字助理正在对以前的用户输入作出反应。先前的工作提议为这项任务建立声响取消(AEC)模型,以优化使用神经网络和信号处理方法的语音增强度量度;由于我们的目标是承认输入式讲话,我们考虑在预测语音信号传递到自动语音识别模式时提高单词错误率的增强。首先,我们增加损失函数,使用一个能产生对预先训练的ASR模型有用的产出的术语,并表明这一扩大的损失函数能够改善WER指标。第二,我们证明,用大型合成数据集增强我们关于真实世界范例的培训数据集提高了性能。从本质上讲,在培训过程中将SpecAugment风格面具应用到参考渠道,帮助模型从合成领域适应到真实领域。在实验评估中,我们发现拟议的方法提高了性能,平均比信号处理基线高出57%,在神经EC模型上提高了45%。