In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the influence of the training target on the spatial selectivity of such a time-varying spectro-spatial filter. We extend the recently proposed complex-valued spatial autoencoder (COSPA) for target speaker extraction by leveraging its interpretable structure and purposefully informing the network of the target speaker's position. Consequently, this approach uses a multichannel complex-valued neural network architecture that is capable of processing spatial and spectral information rendering informed COSPA (iCOSPA) an effective neural spatial filtering method. We train iCOSPA for several training targets that enforce different amounts of spatial processing and analyze the network's spatial filtering capacity. We find that the proposed architecture is indeed capable of learning different spatial selectivity patterns to attain the different training targets.
翻译:在常规多通道音频信号增强中,空间和光谱过滤往往按顺序进行,相反,事实证明,神经空间过滤对光谱空间过滤联合方法更为有益,为此,我们调查培训目标对这种时间变化的光谱空间过滤器的空间选择性的影响;我们利用最近提出的复杂价值空间自动编码器(COSPA)的可解释结构,有意向目标发言者的网络通报其位置,扩大最近提出的用于目标发言者提取的复合空间自动编码器(COSPA),为此我们发现,拟议的结构确实能够学习不同的空间选择性模式,以达到不同的培训目标。