In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the spatial filtering performed by such a time-varying spectro-spatial filter. We extend the recently proposed complex-valued spatial autoencoder (COSPA) for the task of target speaker extraction by leveraging its interpretable structure and purposefully informing the network of the target speaker's position. We show that the resulting informed COSPA (iCOSPA) effectively and flexibly extracts a target speaker from a mixture of speakers. We also find that the proposed architecture is well capable of learning pronounced spatial selectivity patterns and show that the results depend significantly on the training target and the reference signal when computing various evaluation metrics.
翻译:在常规多通道音频信号增强中,空间和光谱过滤往往按顺序进行,相反,已经表明,神经空间过滤对光谱空间过滤联合方法更为有益,为此,我们调查了由这种时间变化的光谱空间过滤器进行的空间过滤;我们通过利用其可解释的结构和有目的地向目标发言者的网络通报位置,将最近提议的复杂价值的空间自动转换器(COSPA)扩大到目标发言者的提取任务;我们表明,由此产生的知情的COSPA(ISCOA)有效和灵活地从发言者混合体中提取了一名目标发言者;我们还发现,拟议的结构非常能够学习明显的空间选择性模式,并表明在计算各种评价指标时,结果在很大程度上取决于培训目标和参考信号。</s>