In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device or ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC).
翻译:在本文中,我们提出“文字回声取消”(TEC)——一个从重复的语音录音中取消文本到语音的回声的框架。这样的系统可以在很大程度上改善语音识别性能和智能设备(如智能扬声器)的用户经验,因为用户可以在设备仍在播放 TTS 信号时与设备交谈,该设备对上一个查询作出反应。我们通过使用具有多源关注的新颖的序列到序列模式来实施这个系统,该模式将麦克风混合信号和TTS回放源文本作为输入,并预测增强的音频。实验显示TTS回放的文字信息对增强性能至关重要。此外,与TTS回放的原始声学信号相比,文字序列的大小要小得多,甚至在回放之前就可立即传送到设备或 ASR 服务器。因此,我们提出的方法有效地减少了互联网通信和拖线,而替代方法如声回取消(AEC) 。