Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this two-stage framework also often introduces unnecessary delays to the AEC system when neural modules are already capable of both linear and nonlinear echo suppression. In this paper, we exploit the offset-compensating ability of complex time-frequency masks and propose an end-to-end complex-valued neural network architecture. The building block of the proposed model is a pseudocomplex extension based on the densely-connected multidilated DenseNet (D3Net) building block, resulting in a very small network of only 354K parameters. The architecture utilized the multi-resolution nature of the D3Net building blocks to eliminate the need for pooling, allowing the network to extract features using large receptive fields without any loss of output resolution. We also propose a dual-mask technique for joint echo and noise suppression with simultaneous speech enhancement. Evaluation on both synthetic and real test sets demonstrated promising results across multiple energy-based metrics and perceptual proxies.
翻译:反响和噪音抑制是全多元通信系统的一个组成部分。许多最近的声响取消(AEC)系统依赖于一个单独的线性回声抑制适应过滤模块和一个神经抑制残余回声抑制模块。然而,不仅适应性过滤模块需要趋同,并且仍然易受声学环境变化的影响,而且这一两阶段框架还经常给AEC系统造成不必要的延误,因为神经模块已经能够线性和非线性回声抑制。在本文中,我们利用复杂时频遮罩的抵消补偿能力,并提出一个终至端复杂、有价值神经网络结构。拟议模型的构件是一个假合成扩展部分,其基础是紧密相连的多层DenseNet(D3Net)建筑块,导致只有354K参数的很小的网络。该结构利用D3Net建筑块的多分辨率性质来消除联合需求,使网络能够利用大型可接收域来提取特征,而不会失去任何输出分辨率。我们还提议一种双向式技术,用于联合回音和噪声抑制,同时同时增强语音。对合成和真实的测试结果都进行了评估。