Acoustic Echo Cancellation (AEC) is essential for accurate recognition of queries spoken to a smart speaker that is playing out audio. Previous work has shown that a neural AEC model operating on log-mel spectral features (denoted "logmel" hereafter) can greatly improve Automatic Speech Recognition (ASR) accuracy when optimized with an auxiliary loss utilizing a pre-trained ASR model encoder. In this paper, we develop a conformer-based waveform-domain neural AEC model inspired by the "TasNet" architecture. The model is trained by jointly optimizing Negative Scale-Invariant SNR (SISNR) and ASR losses on a large speech dataset. On a realistic rerecorded test set, we find that cascading a linear adaptive AEC and a waveform-domain neural AEC is very effective, giving 56-59% word error rate (WER) reduction over the linear AEC alone. On this test set, the 1.6M parameter waveform-domain neural AEC also improves over a larger 6.5M parameter logmel-domain neural AEC model by 20-29% in easy to moderate conditions. By operating on smaller frames, the waveform neural model is able to perform better at smaller sizes and is better suited for applications where memory is limited.
翻译:声频取消( AEC) 是准确识别正在播放音频的智能扬声器( AEC) 的询问的关键。 先前的工作已经表明, 运行于日- 熔光谱特征( 注意“ logmel ” ) 的神经 AEC 模型可以极大地提高自动语音识别( ASR) 精确度, 如果使用经过预先训练的 ASR 模型编码器进行辅助损失优化, 使用辅助损失优化自动语音识别( ASR ) 。 在本文中, 我们开发了一个由“ 塔斯网” 架构所启发的基于符合的波形- 波形- 内线性神经EC 模型。 该模型也通过在大型语音数据集上联合优化负浮标- 内空 SNIR ( SISNR) 和 ASR 损失来培训。 在现实的重录测试集中, 我们发现, 将线性适应 AEC 调整 AEC 和波形- domaineal AEC (W) 校准型模型在20- 至 29 的较小型的中度模型上, 能够以更小型的中度运行更小的内装, 在20- 较小型的模型上, 较小型的内建更小的内, 较容易地进行更小的内装的内装为最小的内容度应用。