For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge. The causality and memory usage limit that only the historical information can be used for the system to capture the time-variant characteristics. We propose to adaptively change the receptive field according to the input signal in deep neural network based SE model. Specifically, in an encoder-decoder framework, a dynamic attention span mechanism is introduced to all the attention modules for controlling the size of historical content used for processing the current frame. Experimental results verify that this dynamic mechanism can better track time-variant factors and capture speech-related characteristics, benefiting to both interference removing and speech quality retaining.
翻译:对于实时语音增强(SE),包括噪音抑制、剥离和声响回声取消,音频信号的时间差成为一项严峻的挑战。因果性和记忆使用限制只能将历史信息用于系统以捕捉时间差异特性。我们提议根据以深神经网络为基础的SE模型输入信号,调整可接收字段的适应性。具体地说,在编码器-解码器框架中,对所有用于控制处理当前框架的历史内容大小的注意模块引入动态关注机制。实验结果证实,这一动态机制能够更好地跟踪时间差异因素并捕捉与语音有关的特性,既有利于干扰去除,又有利于语音质量保留。