深入了解关于改进多频道语音增强的深非线性过滤器 (Insights into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement)

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.

翻译：使用多个麦克风提高语音增强的关键优点是,空间过滤器可以用来补充节拍光谱处理。在传统环境下,线性空间过滤(波形过滤)和单声道后过滤通常单独进行。相比之下,有一种趋势是使用深神经网络学习空间和节拍光非线性联合过滤器,这意味着可以克服线性处理模型的限制以及空间和节拍光谱信息单独处理的局限性。然而,在传统环境下,导致多声道语音增强数据驱动过滤器良好性能的内部机制并没有得到很好理解。因此,在这项工作中,我们分析DNN所实现的非线性空间过滤器的特性,以及它与时间和光谱处理的相互依存性,为此仔细控制网络可利用的信息来源(空间、光谱和时间)。我们确认非线性空间处理模式的优越性,该模型在具有挑战性的语言-直线性空间过滤器过滤器强化多频道语音- 3A 系统化网络的不线性空间过滤器性过滤器性特性,通过对我们的空间-光谱结构进行联合处理的系统化分析,通过时空空格结构增加我们的空间-光学阵列阵列阵列的阵列结构,通过时,通过对我们的空间-直路路路路路路路路路路路路路路路路路路路路的平路路的平路路路的平路的平路的平路路路路路的平路的平路的平路的平路的平路的平路的平路路路路路路路路的平路段分析,通过数。